knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Special thank to Bryan D. Martin for his teaching in STAT 302 and kindly providing us with the example we use for this vignette.
This package is a final project of STAT 302 in UW containing 4 functions we developed during the quarter:
my_t.test my_lm my_knn_cv my_rf_cv
that allow us to analyze data with multiple methods.The package also includes a dataset:
my_gapminder
taken from the famous gapminder dataset.
To install \testtt{Project3Package}, use the code below:
devtools::install_github("SylviaDu99/Project3Package")
To begin, we need to load the following packages and sample dataset:
library(Project3Package) library(magrittr) library(ggplot2) library(dplyr) library(randomForest) library(class) library(stats) data(my_gapminder)
If you are unfamiliar with my_gapminder, you can view a description of the data using:
?my_gapminder
In this tutorial, how each of the function works is shown.
The function my_t.test can be used to perform a one-sampled t test to show whether it is able to reject the null hypothesis.
For demonstration, we use lifeExp from my_gapminder.
We are want to test the null hypothesis that the mean value of lifeExp is 60 $$H_0: \mu = 60$$ with significant level equals to 0.05. $$\alpha = 0.05$$
The first example is a two sided t test $$H_a: \mu \neq 60$$
my_t.test(my_gapminder$lifeExp, "two.sided", 60)
From the output list, the p value is greater than the significant level 0.05, so we cannot reject the null hypothesis.
The second example is a lower tailed t test $$H_a: \mu < 60$$
my_t.test(my_gapminder$lifeExp, "less", 60)
From the output list, the p value is smaller than 0.05, so we can reject the null hypothesis.
The third example is an upper tailed t test $$H_a: \mu > 60$$
my_t.test(my_gapminder$lifeExp, "greater", 60)
From the output list, the p value is greater than 0.05, so we cannot reject the null hypothesis.
The function my_lm can be used to fit a linear model to the input data.
For demonstration, we use data from my_gapminder. Let lifeExp to be the response variable, gdpPercap and continent to be the explanatory variables.
my_model <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) my_model
From the output table, we can see that the coefficient (the Estimate column) of gdpPercap has value about 0.000445, which means lifeExp increases 0.000445 as gdpPercap increases 1.
Set $H_0$ and $H_a$ to:
$$H_0: coefficient = 0$$
$$H_a: coefficient \neq 0$$
and the set the significant level to:
$$ \alpha = 0.05$$
From the output table by my_lm, we can find the p value (the Pr(>|t|) column).
Since the p value is much smaller than 0.05, we can reject the null hypothesis.
To visualize how well the model fits, we can plot the Actual vs Fitted values
formula <- lifeExp ~ gdpPercap + continent f <- model.frame(formula, my_gapminder) # extract the explanatory variable x x <- model.matrix(formula, my_gapminder) # extract the response variable y y <- model.response(f) %>% as.matrix() # y_hat = x * coeff + se y_hat <- x %*% as.matrix(my_model[,1]) + my_model[,2] my_df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = y_hat, "continent" = my_gapminder$continent) ggplot(my_df, aes(x = fitted, y = actual, color = continent)) + geom_point() + geom_abline(slope = 1, intercept = 0) + labs(title = "Actual vs Fitted Values", x = "Fitted Values", y = "Actual Values", color = "Continent") + theme_bw(base_size = 15) + theme(plot.title = element_text(hjust = 0.5), legend.title = element_text(hjust = 0.5))
From the visualization, we can see that this model have better fit in Europe and Oceania than other in continents. In order to get a better fitted model, we need to also try other algorithms.
The function my_knn_cv can be used to predict the class value by k-fold Cross-validation and k-nearest neighbors algorithms.
For demonstration, let continent to be class value that we are predicting by using covariates gdpPercap and lifeExp.
train <- my_gapminder %>% select(gdpPercap, lifeExp) cl <- my_gapminder$continent result <- matrix(NA, nrow = 10, ncol = 2) for (i in 1:10) { output <- my_knn_cv(train, cl, i, 5) # cv misclassfication rate result[i, 1] <- output$cv_err # training misclassification rate result[i, 2] <- output$te } result <- data.frame("number of neighbors" = c(1:10), "CV misclassification rate" = result[, 1], "training misclassification rate" = result[, 2]) result
Leave-One-Out Cross-Validation is an algorithm that splits the data into k different folds randomly, where k-1 folds are used to train the model and the left fold is used to test the model. The process repeats to let each fold to be used as test data. It is useful because we can use it to find a model that minimizes both the training error and the test error, by which we could generate better predictions.
For number of neighbors = 1, training misclassification rate is always 0 and CV misclassification rate is about 0.55, which is (almost) the highest CV misclassification rate among all results from number of neighbors = 1 to 10; for number of neighbors = 10, training misclassification rate is about 0.4 and CV misclassification rate is about 0.48, this case has the lowest CV misclassification rate and the highest training misclassification rate. Based on the training misclassification rates, we would choose the model with number of neighbors = 1. If based on cv misclassification rates, we would choose the model with number of neighbors = 10. In practice, the model with number of neighbors = 5 would be a better choice since both the training misclassification rate and the CV misclassification rate are relatively low.
The function my_rf_cv can be used to predicts the lifeExp in \code{my_gapminder} dataset by gdpPercap with k-fold Cross-validation and random forest algorithms.
For demonstration, let k be 2, 5, 10 and iterate each case 30 times.
cv_error <- matrix(NA, nrow = 90, ncol = 2) row <- 1 for(i in 1:30) { cv_error[i, 1] <- 2 cv_error[i, 2] <- my_rf_cv(2) cv_error[i + 30, 1] <- 5 cv_error[i + 30, 2] <- my_rf_cv(5) cv_error[i + 60, 1] <- 10 cv_error[i + 60, 2] <- my_rf_cv(10) } cv_error
my_df <- data.frame("k" = cv_error[, 1], "mse" = cv_error[, 2]) ggplot(my_df, aes(x = k, y = mse, group = k, fill = factor(k))) + geom_boxplot() + labs(title = "MSE of k folds", x = "Number of Folds (k)", y = "MSE", fill = "Number of Folds") + theme_bw(base_size = 15) + scale_x_continuous(breaks = c(2, 5, 10)) + theme(plot.title = element_text(hjust = 0.5), legend.title = element_text(hjust = 0.5, size = 10))
my_t <- my_df %>% group_by(k) %>% summarise(mean = mean(mse), sd = sd(mse)) %>% select(-k) %>% as.matrix() %>% as.table() rownames(my_t) <- c("k = 2", "k = 5", "k = 10") my_t
According to the boxplot and the table, as k increases, the range of mse and the standard deviation decreases, while the mean and the median of mse increases. The larger the range, the larger the standard deviation is of the data, which explains why 2-fold case has the largest sd and the 10-fold case has the lowest sd. The more folds we create, the more the iterations of the data help to generate a better model, and the prediction would be more accurate.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.