knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This package is created as a class project for Stat302 in UW. It contains four formulas: my_t_test
, my_lm
, my_knn_cv
, and my_rf_cv
.
This package is collabrated by Celeste Zeng and Simona Liao.
Install the package using:
devtools::install_github("SimonaLiao/Stat302Package", build_vignette = TRUE, build_opts = c()) library(Stat302Package) # Use this to view the vignette in the Demo HTML help help(package = "Stat302Package", help_type = "html") # Use this to view the vignette as an isolated HTML file utils::browseVignettes(package = "Stat302Package")
To begin, we load our example data set my_gapminder
.
library(Stat302Package) library(ggplot2) library(magrittr) library(tidyverse) data("my_gapminder")
my_t.test
All the tests demonstrated below use lifeExp
data from my_gapminder
.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}
my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "two.sided")
From the test result, we can know that the p_value is greater than 0.05. So the result is not statistically significant and we do not have enough evidence to reject the null hypothesis, H_0.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}
my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "less")
From the test result, we can know that the p_value is less than 0.05. So the result is statistically significant and we have sufficient evidence to reject the null hypothesis, H_0.
\begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}
my_t_test(my_gapminder$lifeExp, mu = 60, alternative = "greater")
From the test result, we can know that the p_value is greater than 0.05. So the result is not statistically significant and we do not have enough evidence to reject the null hypothesis, H_0.
my_lm
test <- my_lm(my_fml = lifeExp ~ gdpPercap + continent, my_data = my_gapminder) estimate <- test[2, 1] p_val <- test[2, 4] test
From the statistics above, we know that the expected difference in the response lifeExp
between two observations differing by one unit in gdpPercap
, with all other covariates identical, is r estimate
. Comparing to the coeffiecients of different continents, differences in gdpPercap
have less influence on lifeExp
than differences in continent
.
Now, we will do a two sided hypothesis test associated with the gdpPercap
coefficient. The null hypothesis is that there will be no significant prediction of lifeExp
by gdpPercap
.
\begin{align} H_0: \beta &= 0,\ H_a: \beta &\neq 0. \end{align}
The p-value for the hypothesis test is r p_val
, which is smaller than 0.05. The result is statistically significant and we have sufficient evidence to reject the null hypothesis. There will be significant prediction of lifeExp
by gdpPercap
.
Now, we are going to plot the Actual vs. Fitted values using ggplot2
.
my_coef <- test[, 1] my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) y_hat <- my_matrix %*% as.matrix(my_coef) my_df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = y_hat, "continent" = my_gapminder$continent) ggplot(my_df, aes(x = actual, y = fitted, color = continent)) + geom_point() + geom_abline(slope = 1, intercept = 0, col = "black") + theme_bw(base_size = 15) + labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + theme(plot.title = element_text(hjust = 0.5))
From the graph of actual values and fitted values above, we can find out that this model is very inaccurate in the prediction of countries in Africa. But it did a good job of predicting the life expectancy in European countries.
my_knn_cv
Predict Using the my_gapminder
dataset, we predict output class continent using covariates gdpPercap
and lifeExp
by utilizing 5-fold cross validation (k_cv = 5).
# For each value of k_nn, record the training misclassification rate and the CV misclassification rate knn_vector <- c(1 : 10) tu_knn <- matrix(NA, nrow = 10, ncol = 3) for (i in 1 : 10) { tu_knn[i, 1] <- knn_vector[i] # get the cv misclassification rate tu_knn[i, 3] <- my_knn_cv(train = my_gapminder[, 3 : 4], cl = my_gapminder$continent, k_nn = knn_vector[i], k_cv = 5)$cv_err prediction <- my_knn_cv(train = my_gapminder[, 3 : 4], cl = my_gapminder$continent, k_nn = knn_vector[i], k_cv = 5)$class # get tge training misclassification rate tu_knn[i, 2] <- sum(prediction != my_gapminder$continent) / length(my_gapminder$continent) } tu_knn <- data.frame("k_nn" = tu_knn[, 1], "training_misclassification_rate" = tu_knn[, 2], "CV_misclassification_rate" = tu_knn[, 3])
I will use k_nn = 1 model you would choose based on the training misclassification rates. I will use k_nn = 10 based on the CV misclassification rates. I will use cross-validation to identify optimal k and choose it as my k_nn(which is 10) because CV splits data into k folds to use as much data as possible to train the model while still having out-of-sample test data.
CV is a great tool to choose values for tuning parameters across a number of algorithms, not just k-nearest neighbors.
my_rf_cv
Predict lifeExp
using covariate gdpPercap
.
k_vector <- c(2, 5, 10) tu_rf <- matrix(NA, nrow = 90, ncol = 2) for (i in 1 : length(k_vector)) { for (j in 1 : 30) { tu_rf[(i - 1) * 30 + j, 1] <- k_vector[i] tu_rf[(i - 1) * 30 + j, 2] <- my_rf_cv(k_vector[i]) } }
Use boxplots to display data: a boxplot is associated with a k-value(2,5,10), representing 30 simulations each
tu <- data.frame("K_value" = tu_rf[, 1], "MSE" = tu_rf[, 2]) tu_graph <- ggplot(data = tu, aes(x = K_value, y = MSE, group = K_value)) + geom_boxplot(fill = "lightblue") + theme_bw(base_size = 16) + labs(title = "CV estimated MSE for different k-values", x = "k-values", y = "Estimated MSE") + theme(plot.title = element_text(hjust = 0.5, face = "bold")) + scale_x_continuous(breaks = c(2, 5, 10)) rf_vector <- c(mean(tu_rf[1 : 30, 2]), mean(tu_rf[31 : 60, 2]), mean(tu_rf[61 : 90, 2]), sd(tu_rf[1 : 30, 2]), sd(tu_rf[31 : 60, 2]), sd(tu_rf[61 : 90, 2])) rf_matrix <- matrix(rf_vector, nrow = 3, ncol = 2) # Use a table to display the average CV estimate and the standard deviation of the CV estimates across k rf_table <- data.frame("k_values" = c(2, 5, 10),"mean" = rf_matrix[, 1], "sd" = rf_matrix[, 2])
Results of observation: as k gets bigger the means of CV estimation also get bigger, but the standard deviations get smaller. Also, the ranges of CV estimates also get smaller as k gets bigger.
According to the Bias-Variance Tradeoff, larger K means less bias towards overestimating the true expected error (as training folds will be closer to the total dataset) but higher variance and higher running time (as I am getting closer to the limit case: Leave-One-Out CV). Although the results of rf_table shows decreasing standard deviation/variance, it is because Variance of cross-validation is more u-shaped than linearly increasing. Decreases at first because I am taking the average of more trials (the number of folds). It increases later because as k approaches n, the folds become highly correlated. If I add a k-value of 100, the standard deviation/variance will be larger.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.