knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# Attach libraries used for this tutorial

library(STAT302package)
library(class)
library(palmerpenguins)
library(randomForest)
library(magrittr)
library(dplyr)
library(gapminder)
library(tibble)
library(ggplot2)

1. Introduction

This is the final project for STAT302. A package called "STAT302package" is created, and it includes four functions: my_t_test, my_lm, my_knn_cv, and `my_rf_cv``. package collaborators: Cherry Pan & Yi-Hsuan Wu.

The guide to install the package:

devtools::install_github("Cherry-ty-Pan/STAT302package", build_vignette = TRUE, build_opts = c())
library(STAT302package)
# Use this to view the vignette in the Demo HTML help
help(package = "STAT302package", help_type = "html")
# Use this to view the vignette as an isolated HTML file
utils::browseVignettes(package = "STAT302package")

2. Tutorial for my_t.test

Use the lifeExp data from my_gapminder for the following tests

Case I: alternative = "two.sided"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}

# perform t-test with the alternative = two.sided
my_t.test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")

In this case, p-value is 0.093, which is greater than 0.05. Thus the result is not statistically significant. We don't have enough evidence to reject the null hypothesis.

Case II: alternative = "less"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}

# perform t-test with the alternative = less
my_t.test(my_gapminder$lifeExp, mu = 60, alternative = "less")

In this case, p-value is 0.047, which is less than 0.05. Thus the result is statistically significant. We do have enough evidence to reject the null hypothesis.

case III: alternative = "greater"

\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}

# perform t-test with the alternative = greater
my_t.test(my_gapminder$lifeExp, mu = 60, alternative = "greater")

In this case, p-value is 0.0953, which is greater than 0.05. Thus the result is not statistically significant. We don't have enough evidence to reject the null hypothesis.

3. Tutorial for my_lm

# a regression using `lifeExp` as the response variable and `gdpPercap` and
# `continent` as explanatory variables
reg <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)
reg
coef <- reg$Estimate[, 1]
matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
# store actual values and fitted values to a data frame
df <- data.frame("actual" = my_gapminder$lifeExp, 
                 "fitted" = matrix %*% as.matrix(coef), 
                 "continent" = my_gapminder$continent)
# plot the Actual vs. Fitted values graph
ggplot(df, aes(x = actual, y = fitted, color = continent)) +
   geom_point() +
   theme_bw(base_size = 15) +
   labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + 
   theme(plot.title = element_text(hjust = 0.5)) +
  geom_smooth(method = "lm", se = FALSE, aes(group = 1))

After comparing the coefficients of different continents, we can tell that the continent has more influence on lifeExp than gdpPercap

Here, the gdpPercap coefficient is 4.452704e-04, meaning that the expected difference in the lifeExp between two observations differing by one unit in gdpPercap is 4.452704e-04, with all other covariates identical.

We will do a two-sided hypothesis test associated with the gdpPercap coefficient. The null hypothesis is that gdpPercap has no significant influence on lifeExp.

\begin{align} H_0: \beta &= 0,\ H_a: \beta &\neq 0. \end{align}

The p-value for the hypothesis test is 8.553e-73, which is much smaller than 0.05. Thus the result is statistically significant and we have sufficient evidence to reject the null hypothesis. gdpPercap has significant influence on lifeExp.

4. Tutorial for my_knn_cv using my_penguins

# Predict output class species using covariates `bill_length_mm`, `bill_depth_mm`, 
# `flipper_length_mm`, and `body_mass_g`.
# omit all the NA in my_penguins
my_penguins <- na.omit(my_penguins)
train_err <- rep(NA, 10)
cv_err <- rep(NA, 10)
name <- c("k_nn", "Training error", "CV Error")
# store training errors and CV errors
for (i in 1:10) {
  test <- my_knn_cv(my_penguins[, 3:6], my_penguins$species, 1, 5)
  train_err[i] <- sum(my_penguins$species != test[[1]]) / nrow(my_penguins)
  cv_err[i] <- test[[2]]
}
# record the training error and CV error
record <- data.frame("k_nn" = c(1:10), 
                     "Training Error" = train_err, 
                     "CV Error" = cv_err)
record

Based on the table, I will use "k_nn = 1" model because "k_nn = 1" model has the least error rate in both training misclassification rate and error misclassification rate. Cross-validation is a model validation technique that assess how the results will generalize to an independent data set. In practice, we want to choose the model that generates error as less as possible.

5. Tutorial for my_rf_cv

# Predict `body_mass_g` using covariates `bill_length_mm`, `bill_depth_mm`, and 
# `flipper_length_mm` with `my_knn_cv` function and iterate through k in c(2, 5, 10)  
# for 30 iterations each.
# Save the iteration results as `cv_2`, `cv_5` and `cv_10` respectively.
cv_2 <- replicate(30, my_rf_cv(2))
cv_5 <- replicate(30, my_rf_cv(5))
cv_10 <- replicate(30, my_rf_cv(10))
# Store results as a data frame
cv_df <- data.frame("k" = c(rep(2,30),rep(5,30), rep(10,30)),
                 "cv" = c(cv_2,cv_5,cv_10))
# Plot the results as a boxplot
cv_plot <- ggplot(cv_df, aes(factor(k), cv)) +
  geom_boxplot() +
  labs(x = c("k"), y = c("CV estimated MSE"))
cv_plot
# Display the statistics of the results as a table
cv_table <- data.frame("k" = c(2, 5, 10),
                       "Mean" = c(mean(cv_2),mean(cv_5),mean(cv_10)),
                       "Standard Deviation" = c(sd(cv_2),sd(cv_5),sd(cv_10)))
cv_table

From the boxplot, it's observed that CV estimated MSE has greatest mean when k = 2, which is above 120000. However, by inspecting the table, we can see standard deviation of CV estimated MSE is greatest when k = 10. The reason may be due to that 2 folds yield greatest MSE and 5 folds yield greatest variance in this data set.



Cherry-ty-Pan/STAT302package documentation built on Dec. 17, 2021, 2 p.m.