knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

library(magrittr)
library(dplyr)
library(project3package)
library(kableExtra)
library(ggplot2)
library(tidyr)
devtools::install_github("anasmuhd/FirstPackage", build_vignette = TRUE, build_opts = c())
library(FirstPackage)

Tutorials

This set of tutorials consists of four functions to be tested under this package: my_t.test, my_lm, my_knn_cv, and my_rf_cv.

my_t.test

$$H_0: \mu = 60$$ $$H_a: \mu \neq 60$$

$$\alpha = 0.05$$

Test 1, alternative = less

# my_gapminder <- gapminder::gapminder
# my_t.test(my_gapminder$lifeExp, "less", 60)

Explanation: Running the t-test with alternative = less generates output with p-value that is smaller than 0.05. Therefore, we have shown that we can reject the null hypothesis, given that the mu is equal to 60.

Test 2, alternative = greater

# my_t.test(my_gapminder$lifeExp, "greater", 60)

Explanation: Running the t-test with alternative = greater generates output with p-value that is greater than 0.05. Therefore, we fail to reject the null hypothesis, given that the mu is equal to 60.

Test 3, alternative = two.sided

# my_t.test(my_gapminder$lifeExp, "two.sided", 60)

Explanation: Running the t-test with alternative = two.sided generates output with p-value that is greater than 0.05. Therefore, we fail to reject the null hypothesis, given that the mu is equal to 60

my_lm

The demonstration is as following:

# my_model <- my_lm(lifeExp ~ gdpPercap * continent, my_gapminder[c("lifeExp", "gdpPercap", "continent")])
# my_model

Deliberated from the output, we can see that the estimated gdpPercap coefficient is at 0.00137712 which means that as gdpPercap increaees by 1 time, the changes in lifeExp will be in a value of 0.00137712, which makes the rate of growth for gdpPercap almost 100 times higher than the lifeExp.

# lm_fit <- lm(lifeExp ~ gdpPercap * continent,
#              my_gapminder[c("lifeExp", "gdpPercap", "continent")])
# mod_fits <- fitted(lm_fit)
# my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = mod_fits)
# ggplot(my_df, aes(x = fitted, y = actual)) +
#   geom_point() +
#   geom_abline(slope = 1, intercept = 0, col = "blue", lty = 1) +
#   theme_bw() +
#   labs(x = "Predicted values", y = "Actual values", title = "Actual vs. Predicted") +
#   theme(plot.title = element_text(hjust = 0.5))

Explanation: As we can see, the fitted values allows a linear regression line to be formed because it takes the mean of every predicted values. The regression slope represents the average change in the actual values for every fitted unit of each value. The average change is then being used to generate the ratio of increment or decrement between two variables being observed in the data.

my_knn_cv

The demonstration is as following:

# change
# my_penguins <- my_penguins %>% drop_na()
# data_Frame <- my_penguins[c("bill_length_mm", "bill_depth_mm",
#                             "flipper_length_mm", "body_mass_g")]
# class <- my_penguins$species
# cv_errs <- vector(length = 10)
# train_errs <- vector(length = 10)
# for (i in 1:10) {
#   cv_errs[i] <- my_knn_cv(data_Frame, class, i, 5)$cv_err
#   # Count how many data points are misclassified using normal knn
#   if (knn(data_Frame, data_Frame, class, i) != class) {
#     train_errs[i] <- length(knn(data_Frame, data_Frame, class, i)) / nrow(my_penguins)
#   }
# }
# # Print table of errors
# err_df <- cbind(1:10, data.frame(cv_errs), data.frame(train_errs))
# colnames(err_df) <- c("k_nn", "cv_errors", "train_errors")
# err_df

Explanation: By looking at the training misclassification rates, we would choose the value of k_nn = 1 with the error of 0.000. However, looking at the CV misclassification rates, we would choose tany value of k_nn that end up with the error of 0.8018018. The model we would choose in practice is the one with cross-validation method since the error turns out to be consistent and that the gap between the cv errors and training errors become smaller as the value of k_nn increases. The small gap represents more accuracy in training vs. test data being used for the observation. The process of cross validation is a part of sample-splitting procedure to fold the data into small segments to be analyzed as to ease the calculation in real world data that are uncertain and too many. The processes are to first split data into k parts (folds), to use all but 1 fold as your training data and fit the model , to use the remaining fold for your test data and make predictions , to switch which fold is your test data and repeat steps 2 and 3 until all folds have been test data (k times), and lastly, to compute squared error of the folds.

my_rf_cv

# data(my_penguins)
# my_penguins <- my_penguins %>% drop_na()
# 
# mse_mat <- matrix(nrow = 90, ncol = 2)
# ks <- c(2, 5, 10)
# for (i in 1:3) {
#   k <- ks[i]
#   for (j in 1:30) {
#     mse_mat[(i-1)*30+j, 1] <- k
#     mse_mat[(i-1)*30+j, 2] <- my_rf_cv(k)
#   }
# }
# mse_df <- data.frame(names = mse_mat[,1], mse = mse_mat[,2])
# ggplot(data = mse_df, aes(group = k, x = k, y = mse)) + 
#   geom_boxplot(fill = "yellow") +
#   labs(title = "MSE based on Folds", x = "Number of Folds", y = "MSE") +
#   theme_bw(base_size = 15) + 
#   theme(plot.title =
#           element_text(hjust = 0.5),
#         text = element_text(size = 10))
# 
# 
# mse_k2 <- (mse_df %>% filter(k == 2) %>% select(mse))$mse
# mse_k5 <- (mse_df %>% filter(k == 5) %>% select(mse))$mse
# mse_k10 <- (mse_df %>% filter(k == 10) %>% select(mse))$mse
# mse_stat <- rbind(c(2, mean(mse_k2), sd(mse_k2)), 
#                        c(5, mean(mse_k5), sd(mse_k5)), 
#                        c(10, mean(mse_k10), sd(mse_k10)))
# mse_stats_df <- data.frame(k = mse_stat[,1],
#                            mse_mean = mse_stat[,2], 
#                            mse_sd = mse_stat[,3])
# 
# kable_styling(kable(mse_stats_df))

Explanation: from the boxplot, we can see that the number of folds = 2 records the highest mean among all three observations. However, the box is stretched to its biggest width among all three, which also can be seen in the table where 2 folds record the highest standard deviation of 6296.5 whereas the lowest standard deviation is recorded in 10 folds with 2433.0. This is the expected case as smaller number of folds tend to be collected in a large group which also include extreme values that stretches the mean and increase the variability of the data. Higher number of folds means that most observations are categorized based on approximated placement in the graph which means that the mean of every fold is lower due to lack of extreme values. Hence, this lack of variability also decreases the standard deviation that explains the distribution patterns in the data folds.



hadiyusri/project3package documentation built on Dec. 20, 2021, 2:40 p.m.