knitr::opts_chunk$set(echo = TRUE, warning = FALSE)

Introduction

Here, we introduce Stat302Project03, a data analysis packages with the following functions: my_t_test, my_lm, my_knn_cv, and my_rf_cv which are developed based on t.test(), lm(), k-Nearest Neighbours, and random forest respectively.
We use gapminder as our example dataset.
Install Stat302Project03 using:

devtools::install_github("Xiaoying-Z/Stat302Project03", 
                         build_vignette = TRUE, build_opts = c())
library(Stat302Project03)
library(ggplot2)
library(kableExtra)

Tutorial of my_t.test

my_t.test(my_gapminder$lifeExp, "two.sided", 60)  # P = 0.09322877
my_t.test(my_gapminder$lifeExp, "less", 60)       # P = 0.04661438
my_t.test(my_gapminder$lifeExp, "greater", 60)    # P = 0.9533856

The first line of code evaluate the following hypothesis test:
$H_0: \mu = 60$
$H_a: \mu \neq 60$
With $\alpha = 0.05$ and $P = 0.0932$, we cannot draw a conclusion from this result.
The second line of code evaluate the following hypothesis test:
$H_0: \mu = 60$
$H_a: \mu < 60$
With $\alpha = 0.05$ and $P = 0.0466$, we reject $H_0$ in favor of $H_a$ that the observed mean is less than 60. The third line of code evaluate the following hypothesis test:
$H_0: \mu = 60$
$H_a: \mu \neq 60$
With $\alpha = 0.05$ and $P = 0.953$, we have a p-value larger than $\alpha = 0.05$, we cannot draw a conclusion from this result.

Tutorial of my_lm

test <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)
my_coef <- test[, 1]
my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
y_hat <- my_matrix %*% as.matrix(my_coef)

my_data <- data.frame("Actual" = my_gapminder$lifeExp,
                      "Fitted" = y_hat,
                      "Continent" = my_gapminder$continent)
ggplot(my_data,aes(x = Actual, y = Fitted, color = Continent)) + 
  geom_point() + theme_bw(base_size = 15) + 
  labs(title = "Actual vs. Fitted values",
       caption = "Graph 1") + 
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0.5)
)
coeff <- formatC(test$Estimate[2], digit = 3)
pval <- formatC(test$Pr...t..[2], digit = 3)

The coefficient of gdpPercap of the regression model is r coeff. This tells us that there is a positive but not strong relation between gdpPercap and lifeExp.
The p-value for the coefficient of gdpPercap is r pval, which is less than $\alpha = 0.05$, telling us that we can reject the null hypothesis "coefficient of gdpPercap $= 0$" in favor of the alternative hypothesis: gdpPercap is statistically significant for interpreting lifeExp.
Graph 1 is the plot of Actual value vs. Fitted value, and the fitted values match the actual values well. The curved shapes demonstrate a non-linear growth of lifeExp at a high gdpPercap

Tutorial of my_knn_cv

k_cv <- 5
cv_err <- rep(NA, k_cv)
train_err <- rep(NA, k_cv)
for (i in 1:10) {
  my_knn <- my_knn_cv(cbind(my_gapminder[, 4], my_gapminder[,6]), 
                     my_gapminder$continent, i, k_cv)
  cv_err[i] <- my_knn[[2]]
  train_err[i] <- my_knn[[3]]
}
  my_err <- data.frame("knn" = c(1:10),
                       "Training Error" = train_err,
                       "CV Error" = cv_err)
  kable(my_err)

Based on the training error, I will choose the model with k_nn $= 1$ because it has no error. However, if I look at CV misclassification rate, I will choose the model with k_nn $= 7$ because the CV misclassification rate steadily decreases when k_nn goes from $1$ to $7$ and starts to fluctuates after $7$.
In practice, I will choose k_nn$= 7$. From the lecture, we know that k_nn $= 1$ always gives the smallest error on training datasets, which is the result from overfitting, and the training error steadily increases as we increase the value of k_nn. The test error, on the other hand, has a U-shape as we increasing the value of k_nn. By using cross-validation, we are able to test our model on all data, and thus identify the optimal k_nn value which minimize the test error. Therefore, in this case, the model with k_nn $= 7$ avoids overfitting as well as gives a good prediction.

Tutorial of my_rf_cv

k = c(2, 5, 10)
MSE = rep(NA, 3)
cv_err <- matrix(NA, 3, 30)
for (i in 1:3){
  cur <- k[i]
  for (j in 1:30){
    cv_err[i, j] <- my_rf_cv(cur)
  }
  MSE[i] = mean(cv_err[i,])
}
my_data = data.frame("k" = rep(c("2", "5", "10"), each = 30), 
                     "error" = c(cv_err[1,], cv_err[2,], cv_err[3,]))
names(my_data)[2] = "error"
ggplot(my_data, aes(x = k, y = error, group = k)) + geom_boxplot() + 
  labs(title = "Boxplot for CV Estimated MSE",
       caption = "Graph 2") + theme_bw(base_size = 15) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.caption = element_text(hjust = 0.5))

sds = rep(NA, 3)
for (i in 1:3){
  sds[i] = sd(cv_err[i, ])
}
outputsd <- data.frame("k" = k,
                       "mean errors" = MSE,
                       "std.errors" = sds)
kable(outputsd)

Both graph 2 and the table of error statistics show a decreasing std. error as the number of folds we have in the data goes from $1$ to $10$. Variance of cross-validation is more U-shaped than linearly increasing. The std. error decreases at first because the denominator increases as we increase the number of folds of the data, and it increases later because as $k$ gets arbitrarily large, the folds become highly correlated.



Xiaoying-Z/Stat302Project03 documentation built on March 22, 2020, 2:09 a.m.