knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This package contains four functions:\ my_t_test : run a one sample t-test\ my_lm : fit data into a linear model\ my_knn_cv : take n-nearest neighbors as model to run k-fold cross validation.\ my_rf_cv : perform random forest cross validation\


To install and use this package, run the following lines of code in console:

# install package
devtools::install_github("RolinaC/STAT302PACKAGE")
library(STAT302PACKAGE)

# load other packages
library(ggplot2)
library(dplyr)
library(kableExtra)

Here are what we need for data:

data("my_penguins")
data("my_gapminder")

my_t.test Tutorial

Here we make a reference to the 'lifeExp' data from 'my_gapminder' to show how the my_t.test functions. Note that the 'lifeExp' indicates the life expectancy at birth.

Two-sided Test

First, we demonstrate a test of the hypothesis that is two-sided and the null hypothesis is 60 indicating the mean life expectancy is 60 years; meanwhile, the alternative hypothesis holds that the mean life expectancy is not 60 years.

two_side <- my_t_test(my_gapminder$lifeExp, "two.sided", 60)

The p-value, r two_side$p_val, is greater than a (a = 0.05), which cannot reject the null hypothesis, meaning there is not much evidence to support the alternative hypothesis.

One-side Test

Second, we demonstrate two single-tailed tests of the hypothesis, one lower tail and one upper tail. Both has the null hypothesis saying 60 indicating the mean life expectancy is 60 years. The lower-tail alternative hypothesis holds that the life expectancy is less than 60 years and the upper-tail alternative hypothesis holds that the life expectancy is greater than 60 years.

low <- my_t_test(my_gapminder$lifeExp, "less", 60)
up <- my_t_test(my_gapminder$lifeExp, "greater", 60)

The lower-trail p-value, r low$p_val, is less than a (a = 0.05), which is enough to reject the null hypothesis, meaning there is enough evidence to support the alternative hypothesis, the life expectancy is less than 60 years.

The upper -trail p-value, r up$p_val, is greater than a (a = 0.05), which cannot reject the null hypothesis, meaning there is not much evidence to support the alternative hypothesis, the life expectancy is greater than 60 years.

my_lm Tutorial

Here, we demonstrate a regression using 'lifeExp' as the response variable and 'gdpPercap' and continent as explanatory variables, all from 'my_gapminder'.

# run the my_lm to get statistics
eqn <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)
eqn

Let's define what the estimated coefficient means under this condition: * Estimated coefficient : the expected change in 'lifeExp' per unit change in GDP per capita, holding all other variables constant.\ By above model, we know that the estimated coefficient on 'gdpPercap' is r eqn[2, 1]. \

Now, let's plot the model and explore more information on this.

#get data to be fitted
X <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
# calculate the actual
y_hat <- X %*% eqn[, 1]
y <- my_gapminder$lifeExp

df <- data.frame("Actual" = y, "Fitted" = y_hat, col = my_gapminder$continent)
ggplot2::ggplot(df, ggplot2::aes(x = Fitted, y = Actual)) +
  ggplot2::geom_point() +
  ggplot2::geom_abline(slope = 1, intercept = 0, col = "blue") + 
  ggplot2::labs(title = "Actual vs. Fitted", x = "Actual", y = "Fitted") +
  ggplot2::theme_bw() +
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5))
  ggplot2::ggsave("plot1.jpg")

Accordng to the graph, we can conclude that the model follows a linear relationship since most points are plotted along the best fit line with some outliers.

my_knn_cv Tutorial

Here, we use the 'my_knn_cv' function to predict a output class of species by using covariates: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. Now let's use 5-fold cross validation (k_cv = 5) from k_nn = 1 to k_nn = 10 to analyze.

Let's define what the K-fold cross-validation means under this condition:

# clear data
train <- na.omit(my_penguins)
# pull cl from train
cl <- dplyr::pull(train, var = "species")

train <- dplyr::select(train, 3:6)
# store CV miss-classification rate
cv_err <- base::rep(NA, 10)
# store training miss-classification rate
train_err <- base::rep(NA, 10)

for (i in 1:10) {
  # compute and store the CV err
  result <- STAT302PACKAGE::my_knn_cv(train, cl, i, 5)
  cv_err[i] <- result[["cv_error"]]
  # compute and store the train err
  train_err[i] <- mean(result[["class"]] != cl)
}

knn_val <- c(1:10)

df <- data.frame(knn_val, train_err, cv_err)
colnames(df) <- c("k_nn", "training error", "CV error")
# style the table
kableExtra::kable_styling(knitr::kable(df))

Based on the above function, it is rational to choose k_nn to be 1. The reason is that, at k_nn = 1, we have the smallest training mse rate and cv mse rate.

my_rf_cv Tutorial

Here we will manipulate the my_rf_cv to run a random forest cross-validation to predict the 'body_mess_g' by referring to the covariates: bill_length_mm, bill_depth_mm, and flipper_length_mm. We will take k to be 2, 5, and 10 and run the function 30 times to get the statistics.

# create an empty matrix
result <- matrix(NA, nrow = 30, ncol = 3)
for (k in c(2, 5, 10)) {
  for (i in 1:30) {
    for (j in 1:3){
      result[i, j] <- my_rf_cv(k)
    }
  }
}

df <- as.data.frame(result)
colnames(df) <- c("k = 2", "k = 5", "k = 10")


#plot the simulations
plot_df <- as.data.frame(matrix(NA, nrow = 90, ncol = 2))
n <- c(df$`k = 2`, df$`k = 5`, df$`k = 10`)
plot_df[, 1] <- cbind(n)
plot_df[, 2] <- cbind(rep(c("2", "5", "10"), each = 30))
colnames(plot_df) <- c("mse", "k")

plot <- ggplot2::ggplot(data = plot_df, 
              aes(x = k, y = mse)) +
         ggplot2::geom_boxplot(fill = "lightblue") +
         ggplot2::scale_x_discrete(limits = c("2", "5", "10")) +
         ggplot2::labs(title = "Distribution of CV estimated MSE by number of                         folds", x = "Number of folds", 
                       y = "CV estimated MSE") +
         ggplot2::theme_bw() +
         ggplot2::theme(plot.title = element_text(hjust = 0.5, size = 14))
         ggplot2::ggsave("plot2.jpg")
plot

By the above boxplots, notice that when k = 2, we have the largest mse and medians of mse, while the statistics of k = 10 are quite smaller. Thus, we can conclude that as the k increases, the cv mse decreases.



RolinaC/STAT302PACKAGE documentation built on Dec. 18, 2021, 10:59 a.m.