knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This package contains four functions:\ my_t_test : run a one sample t-test\ my_lm : fit data into a linear model\ my_knn_cv : take n-nearest neighbors as model to run k-fold cross validation.\ my_rf_cv : perform random forest cross validation\
To install and use this package, run the following lines of code in console:
# install package devtools::install_github("RolinaC/STAT302PACKAGE")
library(STAT302PACKAGE) # load other packages library(ggplot2) library(dplyr) library(kableExtra)
Here are what we need for data:
data("my_penguins") data("my_gapminder")
Here we make a reference to the 'lifeExp' data from 'my_gapminder' to show how the my_t.test functions. Note that the 'lifeExp' indicates the life expectancy at birth.
First, we demonstrate a test of the hypothesis that is two-sided and the null hypothesis is 60 indicating the mean life expectancy is 60 years; meanwhile, the alternative hypothesis holds that the mean life expectancy is not 60 years.
two_side <- my_t_test(my_gapminder$lifeExp, "two.sided", 60)
The p-value, r two_side$p_val
, is greater than a (a = 0.05), which cannot reject the null hypothesis, meaning there is not much evidence to support the alternative hypothesis.
Second, we demonstrate two single-tailed tests of the hypothesis, one lower tail and one upper tail. Both has the null hypothesis saying 60 indicating the mean life expectancy is 60 years. The lower-tail alternative hypothesis holds that the life expectancy is less than 60 years and the upper-tail alternative hypothesis holds that the life expectancy is greater than 60 years.
low <- my_t_test(my_gapminder$lifeExp, "less", 60) up <- my_t_test(my_gapminder$lifeExp, "greater", 60)
The lower-trail p-value, r low$p_val
, is less than a (a = 0.05), which is enough to reject the null hypothesis, meaning there is enough evidence to support the alternative hypothesis, the life expectancy is less than 60 years.
The upper -trail p-value, r up$p_val
, is greater than a (a = 0.05), which cannot reject the null hypothesis, meaning there is not much evidence to support the alternative hypothesis, the life expectancy is greater than 60 years.
Here, we demonstrate a regression using 'lifeExp' as the response variable and 'gdpPercap' and continent as explanatory variables, all from 'my_gapminder'.
# run the my_lm to get statistics eqn <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) eqn
Let's define what the estimated coefficient means under this condition:
* Estimated coefficient : the expected change in 'lifeExp' per unit change in GDP per capita, holding all other variables constant.\
By above model, we know that the estimated coefficient on 'gdpPercap' is r eqn[2, 1]
. \
Now, let's plot the model and explore more information on this.
#get data to be fitted X <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) # calculate the actual y_hat <- X %*% eqn[, 1] y <- my_gapminder$lifeExp df <- data.frame("Actual" = y, "Fitted" = y_hat, col = my_gapminder$continent) ggplot2::ggplot(df, ggplot2::aes(x = Fitted, y = Actual)) + ggplot2::geom_point() + ggplot2::geom_abline(slope = 1, intercept = 0, col = "blue") + ggplot2::labs(title = "Actual vs. Fitted", x = "Actual", y = "Fitted") + ggplot2::theme_bw() + ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5)) ggplot2::ggsave("plot1.jpg")
Accordng to the graph, we can conclude that the model follows a linear relationship since most points are plotted along the best fit line with some outliers.
Here, we use the 'my_knn_cv' function to predict a output class of species by using covariates: bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. Now let's use 5-fold cross validation (k_cv = 5) from k_nn = 1 to k_nn = 10 to analyze.
Let's define what the K-fold cross-validation means under this condition:
# clear data train <- na.omit(my_penguins) # pull cl from train cl <- dplyr::pull(train, var = "species") train <- dplyr::select(train, 3:6) # store CV miss-classification rate cv_err <- base::rep(NA, 10) # store training miss-classification rate train_err <- base::rep(NA, 10) for (i in 1:10) { # compute and store the CV err result <- STAT302PACKAGE::my_knn_cv(train, cl, i, 5) cv_err[i] <- result[["cv_error"]] # compute and store the train err train_err[i] <- mean(result[["class"]] != cl) } knn_val <- c(1:10) df <- data.frame(knn_val, train_err, cv_err) colnames(df) <- c("k_nn", "training error", "CV error") # style the table kableExtra::kable_styling(knitr::kable(df))
Based on the above function, it is rational to choose k_nn to be 1. The reason is that, at k_nn = 1, we have the smallest training mse rate and cv mse rate.
Here we will manipulate the my_rf_cv to run a random forest cross-validation to predict the 'body_mess_g' by referring to the covariates: bill_length_mm, bill_depth_mm, and flipper_length_mm. We will take k to be 2, 5, and 10 and run the function 30 times to get the statistics.
# create an empty matrix result <- matrix(NA, nrow = 30, ncol = 3) for (k in c(2, 5, 10)) { for (i in 1:30) { for (j in 1:3){ result[i, j] <- my_rf_cv(k) } } } df <- as.data.frame(result) colnames(df) <- c("k = 2", "k = 5", "k = 10") #plot the simulations plot_df <- as.data.frame(matrix(NA, nrow = 90, ncol = 2)) n <- c(df$`k = 2`, df$`k = 5`, df$`k = 10`) plot_df[, 1] <- cbind(n) plot_df[, 2] <- cbind(rep(c("2", "5", "10"), each = 30)) colnames(plot_df) <- c("mse", "k") plot <- ggplot2::ggplot(data = plot_df, aes(x = k, y = mse)) + ggplot2::geom_boxplot(fill = "lightblue") + ggplot2::scale_x_discrete(limits = c("2", "5", "10")) + ggplot2::labs(title = "Distribution of CV estimated MSE by number of folds", x = "Number of folds", y = "CV estimated MSE") + ggplot2::theme_bw() + ggplot2::theme(plot.title = element_text(hjust = 0.5, size = 14)) ggplot2::ggsave("plot2.jpg") plot
By the above boxplots, notice that when k = 2, we have the largest mse and medians of mse, while the statistics of k = 10 are quite smaller. Thus, we can conclude that as the k increases, the cv mse decreases.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.