knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(happypackage) library(ggplot2) library(class) library(tidyverse)
This package is generated for the courese STAT 302 as project 3 in University of Washington. The package includes four functions: my_t_test, my_lm, my_knn_cv, and my_rf_cv.
my_t_test performs a one sample t-test in R.
my_lm fits a linear model in R.
my_knn_cv performs a k-Nearest Neighbors Cross-Validation in R.
my_rf_cv performs a Random Forest Cross-Validation in R.
\n
\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}
my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided") p_val <- my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")$p_val
The p-value here is r p_val, which is larger than 0.05. So we cannot conclude that a significant difference exists, and we fail to reject the null hypothesis.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}
my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "less") p_val_2 <- my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "less")$p_val
The p-value here is r p_val_2, which is smaller than 0.05. So we can conclude that a significant difference exists, and we reject the null hypothesis.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}
my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "greater") p_val_3 <- my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "greater")$p_val
The p-value here is r p_val_3, which is larger than 0.05. So we cannot conclude that a significant difference exists, and we fail to reject the null hypothesis.
# load gapminder data data("my_gapminder") # use my_lm to model lm_test <- my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder) summary(lm_test)
According to the summary, the p value is less than 0.05, so we reject the null hypothesis. Therefore, the slope of the linear regression line is other than 0.
# extract the intercepts my_coef <- lm_test[, 1] my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) # calculate the fitted value y_hat <- my_matrix %*% as.matrix(my_coef) # create data frame for the dot plot lm_data <- data.frame("Actual" = my_gapminder$lifeExp, "Fitted" = y_hat, "Continent" = my_gapminder$continent) # draw the dot plot lm_plot <- ggplot(lm_data, aes(x = Actual, y = Fitted, color = Continent)) + geom_point() + theme_bw(base_size = 12) + labs(title = "Fitted life Expectancy VS Actual life Expectancy") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 12)) + geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) lm_plot
According to the graph, we can see that the model fits all continents well, except Africa.
While Europe has the overall highest life expectancy, following by Americans and Asia.
# create an empty matrix to record the training misclassification rate and the CV misclassification rate result <- matrix(NA, nrow = 2, ncol = 10) # iterate from k_nn=1,…,10 and store the training misclassification rate and the CV misclassification rate for(i in 1:10) { test_i <- my_knn_cv(my_gapminder, my_gapminder$continent, i, 5) result[1, i] <- test_i$cv_error accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100} tab_i <- table(test_i$class, my_gapminder$continent) training_set_error_k_nn <- round((100 - accuracy(tab_i)) / 100, digit = 2) result[2, i] <- training_set_error_k_nn } # provide row and column name for the matrix rownames(result) <- c("cv_err", "train_err") colnames(result) <- paste("knn_", seq(1:10), sep = "") # print the matrix result
I would choose knn = 1 model because it has the smallest training misclassification rates.
I would choose knn = 10 model because it ahs the smallest CV misclassification rates.
I would choose knn = 10 in practice because its difference between cv error and training error is smallest.
The process of cross-validation is that, it splits the data into k folds, and repeats choosing one fold as test data while others become training data. After each fold appeared as test data, it calculates the average of mean squared error.
It is useful because it makes the model more specific but not overfitted.
# create a matrix with 90 rows and 1 column cv_err_matrix <- matrix(NA, 90, 1) # apply function my_rf_cv with k = 2, 5, 10 and store results into the matrix for(i in 1:30) { cvv_err_2 <- my_rf_cv(2) cv_err_matrix[i, 1] <- cvv_err_2 cvv_err_5 <- my_rf_cv(5) cv_err_matrix[30 + i, 1] <- cvv_err_5 cvv_err_10 <- my_rf_cv(10) cv_err_matrix[60 + i, 1] <- cvv_err_10 } # create another matrix for identify the numnber of k for each data so that we can apply it for graphing the box plot later cv_err_type_matrix <- matrix(NA, 90, 1) for(i in 1:30) { cv_err_type_matrix[i, 1] <- "k = 2" cv_err_type_matrix[30 + i, ] <- "k = 5" cv_err_type_matrix[60 + i, ] <- "k = 10" } # calculate the mean and standard deviation for all the CV estimate in each 30 simulations each for each k values k_2_mean <- mean(cv_err_matrix[1:30, ]) k_2_sd <- sd(cv_err_matrix[1:30, ]) k_5_mean <- mean(cv_err_matrix[31:60, ]) k_5_sd <- sd(cv_err_matrix[31:60, ]) k_10_mean <- mean(cv_err_matrix[61:90, ]) k_10_sd <- sd(cv_err_matrix[61:90, ]) k_table_matrix <- matrix(c(k_2_mean, k_2_sd, k_5_mean, k_5_sd, k_10_mean, k_10_sd), ncol = 2, byrow = TRUE) # provide row and column name for the matrix rownames(k_table_matrix) <- c("k = 2", "k = 5", "k = 10") colnames(k_table_matrix) <- c("mean", "sd") # convert the matrix to a table and display the table k_table <- as.table(k_table_matrix) k_table # create data frame for graphing the boxplots cv_err_data_frame <- data.frame(cv_err = cv_err_matrix[, 1], k_value = cv_err_type_matrix[, 1]) # graph boxplot for the CV errors for all k values used boxplot <- ggplot(data = cv_err_data_frame, aes(x = reorder(k_value, cv_err, FUN = median), y = cv_err)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 12) + labs(title = "Cross validation error VS k value") + theme(plot.title = element_text(hjust = 0.5, face = "bold", size = 12)) + theme(plot.title = element_text(hjust = 0.5)) + ylim(c(70, 85)) + xlab("k fold value") + ylab("Cross Validation Error") boxplot
From the box plot, we can observe that as the k fold value gets bigger, the variance in cross validation error gets smaller, the standard deviations get smaller, and the median is increasing.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.