knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(STAT302package)
library(ggplot2)
library(reshape)

Introduction

This package contains all functions developed in the course STAT 302. Each function demonstrates an inference/prediction method and works with the gapminder data.

Installation and Use

To download the STAT302package package, use the code below

# install.packages("devtools")
devtools::install_github("alishaluo/STAT302package")
library(STAT302package)
devtools::install_github("alishaluo/STAT302package", build_vignette = TRUE, 
                         build_opts = c())
library(STAT302package)
# Use this to view the vignette in the corncob HTML help
help(package = "STAT302package", help_type = "html")
# Use this to view the vignette as an isolated HTML file
utils::browseVignettes(package = "STAT302package")

my_t_test function

two_side_test <- my_t_test(my_gapminder$lifeExp, alternative = "two.sided",
                           mu = 60)
less_test <- my_t_test(my_gapminder$lifeExp, alternative = "less", mu = 60)
greater_test <- my_t_test(my_gapminder$lifeExp, alternative = "greater", 
                          mu = 60)
test_result <- as.data.frame(cbind(two_side_test$p_value, less_test$p_value,
                    greater_test$p_value))
names(test_result) <- c("two.sided", "less", "greater")
test_result

If the life expectancy is an average of 60 years, we cannot conclude that 9.32% of samples will not equal 60, thus failing to reject the null hypothesis. If the life expectancy is an average of 60 years, only 4.66% of samples will be less than 60. If the life expectancy is an average of 60 years, only 4.66% of samples will be greater than 60.

my_lm function

test <- my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder)
test
my_coef <- test[, 1]
my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
y_hat <- my_matrix %*% as.matrix(my_coef)

my_data <- data.frame("Actual" = my_gapminder$lifeExp, "Fitted" = y_hat, 
                      "Continent" = my_gapminder$continent)
ggplot(my_data, aes(x = Actual, y = Fitted, color = Continent)) + 
  geom_point() + 
  theme_bw(base_size = 2) +
  geom_abline(slope = 1, intercept = 0, col = "black") +
  labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") +
  theme(text= element_text(size = 10))

The null hypothesis would be that there is no effect gdpPercap has on life expectancy. Since the p-value is so low, assuming that gdp per capita does not have an effect on life expectancy, we can say that 8.55e-71 percent of samples will observe gdp per capita having an effect on life expectancy, thus rejecting the null hypothesis.

By looking at the graph, we can see that Europe and Oceania were predicted pretty well, while for a continent like Africa, we predicted the life expectancy to be somewhere from 35 to 75, but the actual life expectancy ranged closer to 50.

my_knn_cv function

set.seed(123)
my_output_err <- rep(NA, 10)
my_train_err <- rep(NA, 10)

for(i in 1:10) {
  my_output <- my_knn_cv(train = my_gapminder[, c(4, 6)], 
                         cl = my_gapminder$continent, k_nn = i, k_cv = 5)
  my_train_err[i] <- mean(my_output$class != my_gapminder$continent)
  my_output_err[i] <- (my_output$CV_error)
}
my_result <- as.data.frame(cbind(my_output_err, my_train_err))
names(my_result) <- c("CV Error", "Training Error")
my_result

Based on what we see in the table, the CV error is decreasing as the number of neighbors increases, and so we would choose 10 for knn, however, looking at the training error, the exact opposite happens. The training error increases as the number of neighbors increases, thus we would choose knn = 1. This happens because of overfitting, where the cv error gets more accurate to the data that you have predicted, but it is too precise to the predicted data, and any newly added data, would cause a lot of error. Whereas, the training error shows that with less number of neighbors, the predicted data may be more incorrect, but may more accurately predict any new data points we add in.

my_rf_cv function

set.seed(321)
cv_mse_2 <- rep(NA, 30)
cv_mse_5 <- rep(NA, 30)
cv_mse_10 <- rep(NA, 30)
for(i in 1:30) {
  cv_mse_2[i] <- my_rf_cv(k = 2)
  cv_mse_5[i] <- my_rf_cv(k = 5)
  cv_mse_10[i] <- my_rf_cv(k = 10)
}
cv_mse <- as.data.frame(cbind(cv_mse_2, cv_mse_5, cv_mse_10))
cv_mse_melt <- melt(cv_mse)
ggplot(cv_mse_melt, aes(factor(variable), value)) +
  geom_boxplot(fill = "lightblue") +
  labs(x = "Number of Folds", y = "CV Error") +
  scale_x_discrete(labels = c("2", "5", "10"))
avg <- rep(NA, 3)
std_dev <- rep(NA, 3)
for(i in 1:length(cv_mse)) {
  avg[i] <- mean(cv_mse[, i])
  std_dev[i] <- sd(cv_mse[, i])
}
as.data.frame(cbind(avg, std_dev))

In both the graph and table, we can see that as the number of folds increases, the cross validation error increase, we also see the standard deviation decrease, and with even more folds, we would see that the standard deviation would start to increase again. This again, would have to do with overfitting the data, when we predict the data with more number of folds, the folds become more highly correlated.



alishaluo/STAT302package documentation built on March 17, 2020, 4:37 a.m.