knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
# Attach libraries used for this tutorial library(STAT302package) library(class) library(palmerpenguins) library(randomForest) library(magrittr) library(dplyr) library(gapminder) library(tibble) library(ggplot2)
This is the final project for STAT302. A package called "STAT302package" is created, and it includes four functions: my_t_test
, my_lm
, my_knn_cv
, and `my_rf_cv``.
package collaborators: Cherry Pan & Yi-Hsuan Wu.
The guide to install the package:
devtools::install_github("Cherry-ty-Pan/STAT302package", build_vignette = TRUE, build_opts = c()) library(STAT302package) # Use this to view the vignette in the Demo HTML help help(package = "STAT302package", help_type = "html") # Use this to view the vignette as an isolated HTML file utils::browseVignettes(package = "STAT302package")
my_t.test
Use the lifeExp
data from my_gapminder
for the following tests
\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}
# perform t-test with the alternative = two.sided my_t.test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")
In this case, p-value is 0.093, which is greater than 0.05. Thus the result is not statistically significant. We don't have enough evidence to reject the null hypothesis.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}
# perform t-test with the alternative = less my_t.test(my_gapminder$lifeExp, mu = 60, alternative = "less")
In this case, p-value is 0.047, which is less than 0.05. Thus the result is statistically significant. We do have enough evidence to reject the null hypothesis.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}
# perform t-test with the alternative = greater my_t.test(my_gapminder$lifeExp, mu = 60, alternative = "greater")
In this case, p-value is 0.0953, which is greater than 0.05. Thus the result is not statistically significant. We don't have enough evidence to reject the null hypothesis.
my_lm
# a regression using `lifeExp` as the response variable and `gdpPercap` and # `continent` as explanatory variables reg <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) reg coef <- reg$Estimate[, 1] matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) # store actual values and fitted values to a data frame df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = matrix %*% as.matrix(coef), "continent" = my_gapminder$continent) # plot the Actual vs. Fitted values graph ggplot(df, aes(x = actual, y = fitted, color = continent)) + geom_point() + theme_bw(base_size = 15) + labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + theme(plot.title = element_text(hjust = 0.5)) + geom_smooth(method = "lm", se = FALSE, aes(group = 1))
After comparing the coefficients of different continents, we can tell that the continent
has more influence on lifeExp
than gdpPercap
Here, the gdpPercap
coefficient is 4.452704e-04, meaning that the expected difference in the lifeExp
between two observations differing by one unit in gdpPercap
is 4.452704e-04, with all other covariates identical.
We will do a two-sided hypothesis test associated with the gdpPercap
coefficient. The null hypothesis is that gdpPercap
has no significant influence on lifeExp
.
\begin{align} H_0: \beta &= 0,\ H_a: \beta &\neq 0. \end{align}
The p-value for the hypothesis test is 8.553e-73, which is much smaller than 0.05. Thus the result is statistically significant and we have sufficient evidence to reject the null hypothesis. gdpPercap
has significant influence on lifeExp
.
my_knn_cv
using my_penguins
# Predict output class species using covariates `bill_length_mm`, `bill_depth_mm`, # `flipper_length_mm`, and `body_mass_g`. # omit all the NA in my_penguins my_penguins <- na.omit(my_penguins) train_err <- rep(NA, 10) cv_err <- rep(NA, 10) name <- c("k_nn", "Training error", "CV Error") # store training errors and CV errors for (i in 1:10) { test <- my_knn_cv(my_penguins[, 3:6], my_penguins$species, 1, 5) train_err[i] <- sum(my_penguins$species != test[[1]]) / nrow(my_penguins) cv_err[i] <- test[[2]] } # record the training error and CV error record <- data.frame("k_nn" = c(1:10), "Training Error" = train_err, "CV Error" = cv_err) record
Based on the table, I will use "k_nn = 1" model because "k_nn = 1" model has the least error rate in both training misclassification rate and error misclassification rate. Cross-validation is a model validation technique that assess how the results will generalize to an independent data set. In practice, we want to choose the model that generates error as less as possible.
my_rf_cv
# Predict `body_mass_g` using covariates `bill_length_mm`, `bill_depth_mm`, and # `flipper_length_mm` with `my_knn_cv` function and iterate through k in c(2, 5, 10) # for 30 iterations each. # Save the iteration results as `cv_2`, `cv_5` and `cv_10` respectively. cv_2 <- replicate(30, my_rf_cv(2)) cv_5 <- replicate(30, my_rf_cv(5)) cv_10 <- replicate(30, my_rf_cv(10)) # Store results as a data frame cv_df <- data.frame("k" = c(rep(2,30),rep(5,30), rep(10,30)), "cv" = c(cv_2,cv_5,cv_10)) # Plot the results as a boxplot cv_plot <- ggplot(cv_df, aes(factor(k), cv)) + geom_boxplot() + labs(x = c("k"), y = c("CV estimated MSE")) cv_plot # Display the statistics of the results as a table cv_table <- data.frame("k" = c(2, 5, 10), "Mean" = c(mean(cv_2),mean(cv_5),mean(cv_10)), "Standard Deviation" = c(sd(cv_2),sd(cv_5),sd(cv_10))) cv_table
From the boxplot, it's observed that CV estimated MSE has greatest mean when k = 2, which is above 120000. However, by inspecting the table, we can see standard deviation of CV estimated MSE is greatest when k = 10. The reason may be due to that 2 folds yield greatest MSE and 5 folds yield greatest variance in this data set.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.