knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(happypackage)
library(ggplot2)
library(class)
library(tidyverse)

Introduction

This package is generated for the courese STAT 302 as project 3 in University of Washington. The package includes four functions: my_t_test, my_lm, my_knn_cv, and my_rf_cv.
my_t_test performs a one sample t-test in R. my_lm fits a linear model in R. my_knn_cv performs a k-Nearest Neighbors Cross-Validation in R. my_rf_cv performs a Random Forest Cross-Validation in R. \n

Tutorial

1. my_t_test tutorial

alternative: two.sided

\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")
p_val <- my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")$p_val

The p-value here is r p_val, which is larger than 0.05. So we cannot conclude that a significant difference exists, and we fail to reject the null hypothesis.

alternative: less

\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "less")
p_val_2 <- my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "less")$p_val

The p-value here is r p_val_2, which is smaller than 0.05. So we can conclude that a significant difference exists, and we reject the null hypothesis.

alternative: greater

\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}

my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "greater")
p_val_3 <- my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "greater")$p_val

The p-value here is r p_val_3, which is larger than 0.05. So we cannot conclude that a significant difference exists, and we fail to reject the null hypothesis.

2. my_lm tutorial

# load gapminder data
data("my_gapminder")
# use my_lm to model
lm_test <- my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder)
summary(lm_test)

According to the summary, the p value is less than 0.05, so we reject the null hypothesis. Therefore, the slope of the linear regression line is other than 0.

# extract the intercepts
my_coef <- lm_test[, 1]
my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, 
                          data = my_gapminder)
# calculate the fitted value 
y_hat <- my_matrix %*% as.matrix(my_coef)
# create data frame for the dot plot
lm_data <- data.frame("Actual" = my_gapminder$lifeExp,
                      "Fitted" = y_hat,
                      "Continent" = my_gapminder$continent)
# draw the dot plot
lm_plot <- ggplot(lm_data, aes(x = Actual, y = Fitted, color = Continent)) +
  geom_point() +
  theme_bw(base_size = 12) +
  labs(title = "Fitted life Expectancy VS Actual life Expectancy") + 
  theme(plot.title = element_text(hjust = 0.5,
                                  face = "bold",
                                  size = 12)) +
  geom_abline(slope = 1, intercept = 0, col = "red", lty = 2)
lm_plot

According to the graph, we can see that the model fits all continents well, except Africa.
While Europe has the overall highest life expectancy, following by Americans and Asia.

3. my_knn_cv tutorial

# create an empty matrix to record the training misclassification rate and the CV misclassification rate
result <- matrix(NA, nrow = 2, ncol = 10)
# iterate from k_nn=1,…,10 and store the training misclassification rate and the CV misclassification rate
for(i in 1:10) {
  test_i <- my_knn_cv(my_gapminder, my_gapminder$continent, i, 5)
  result[1, i] <- test_i$cv_error
  accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
  tab_i <- table(test_i$class, my_gapminder$continent)
  training_set_error_k_nn <- round((100 - accuracy(tab_i)) / 100, digit = 2)
  result[2, i] <- training_set_error_k_nn
}
# provide row and column name for the matrix 
rownames(result) <- c("cv_err", "train_err")
colnames(result) <- paste("knn_", seq(1:10), sep = "")
# print the matrix
result

I would choose knn = 1 model because it has the smallest training misclassification rates.
I would choose knn = 10 model because it ahs the smallest CV misclassification rates. I would choose knn = 10 in practice because its difference between cv error and training error is smallest.

The process of cross-validation is that, it splits the data into k folds, and repeats choosing one fold as test data while others become training data. After each fold appeared as test data, it calculates the average of mean squared error.

It is useful because it makes the model more specific but not overfitted.

4. my_rf_cv tutorial

# create a matrix with 90 rows and 1 column 
cv_err_matrix <- matrix(NA, 90, 1)
# apply function my_rf_cv with k = 2, 5, 10 and store results into the matrix 
for(i in 1:30) {
  cvv_err_2 <- my_rf_cv(2)
  cv_err_matrix[i, 1] <- cvv_err_2
  cvv_err_5 <- my_rf_cv(5)
  cv_err_matrix[30 + i, 1] <- cvv_err_5
  cvv_err_10 <- my_rf_cv(10)
  cv_err_matrix[60 + i, 1] <- cvv_err_10
}
# create another matrix for identify the numnber of k for each data so that we can apply it for graphing the box plot later 
cv_err_type_matrix <- matrix(NA, 90, 1)
for(i in 1:30) {
  cv_err_type_matrix[i, 1] <- "k = 2"
  cv_err_type_matrix[30 + i, ] <- "k = 5"
  cv_err_type_matrix[60 + i, ] <- "k = 10"
}
# calculate the mean and standard deviation for all the CV estimate in each 30 simulations each for each k values
k_2_mean <- mean(cv_err_matrix[1:30, ])
k_2_sd <- sd(cv_err_matrix[1:30, ])
k_5_mean <- mean(cv_err_matrix[31:60, ])
k_5_sd <- sd(cv_err_matrix[31:60, ])
k_10_mean <- mean(cv_err_matrix[61:90, ])
k_10_sd <- sd(cv_err_matrix[61:90, ])
k_table_matrix <- matrix(c(k_2_mean, k_2_sd, k_5_mean, k_5_sd, k_10_mean, 
                           k_10_sd), ncol = 2, byrow = TRUE)
# provide row and column name for the matrix
rownames(k_table_matrix) <- c("k = 2", "k = 5", "k = 10")
colnames(k_table_matrix) <- c("mean", "sd")
# convert the matrix to a table and display the table
k_table <- as.table(k_table_matrix)
k_table
# create data frame for graphing the boxplots
cv_err_data_frame <- data.frame(cv_err = cv_err_matrix[, 1],
                                k_value = cv_err_type_matrix[, 1])
# graph boxplot for the CV errors for all k values used 
boxplot <- ggplot(data = cv_err_data_frame, 
       aes(x = reorder(k_value, cv_err, FUN = median), y = cv_err)) +
  geom_boxplot(fill = "lightblue") +
  theme_bw(base_size = 12) +
  labs(title = "Cross validation error VS k value") +
  theme(plot.title = element_text(hjust = 0.5,
                                  face = "bold",
                                  size = 12)) + 
  theme(plot.title = element_text(hjust = 0.5)) +
  ylim(c(70, 85)) +
  xlab("k fold value") +
  ylab("Cross Validation Error")
boxplot

From the box plot, we can observe that as the k fold value gets bigger, the variance in cross validation error gets smaller, the standard deviations get smaller, and the median is increasing.



atlasyao/happypackage documentation built on March 23, 2020, 5:23 a.m.