\addtolength{\headheight}{-.025\textheight} \thispagestyle{fancyplain}
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
I would like to express my special thanks of gratitude to Bryan D. Martin who taught an amazing quarter in Stat302.
This package is the project3 for Stat302 in UW. It contains four functions:
my_t_test my_lm my_knn_cv my_rf_cv
that effectively analyzes data and evaluates statistical models. It also incorporates a dataset my_gapminder taken from the famous gapminder package.You can install my package through Github in the following way:
devtools::install_github("yinuotxie/package302")
To begin, we need to load the following packages and the example dataset.
library(package302) library(magrittr) library(ggplot2) library(dplyr) library(kableExtra) data("my_gapminder")
If you are unfamiliar with my_gapminder, you can check the description of the data using ?my_gapminder in the console.
Or, you can directly view the dataset:
my_gapminder
In the tutorial section, I will demonstrate how to use each function in the practice of analyzing data.
Function, my_t_test}, can be used to perform a one-sample t-test. It will be very helpful when you want to test the null hypothesis. We are going to make some t-tests here by using lifeExp variable from my_gapminder dataset.
Here, we want to test the null hypothesis in which the mean value of lifeExp is 60. So we will set mu = 60.
mu <- 60
And: $$H_0: \mu = 60$$ We also want to set the significant level to 0.05. $$\alpha = 0.05$$ Now, we are ready to do the t-tests.
First example -- two.tailed t-test: $$H_a: \mu \neq 60$$
my_t_test(data = my_gapminder$lifeExp, mu, alt = "two.sided")
From the output, we can see the t-value, degree of freedom, p-value, and the alternative hypothesis. The p-value obtained is greater than the significant level of 0.05, so we don't reject the null hypothesis.
Second example -- one.tailed t-test (less) $$H_a: \mu < 60$$
my_t_test(data = my_gapminder$lifeExp, mu, alt = "less")
The p-value for "less" is less than the significant level of 0.05. Thus, we reject the null hypothesis and accept the alternative hypothesis.
Third example -- one.tailed t-test (greater)
my_t_test(data = my_gapminder$lifeExp, mu, alt = "greater")
We don't reject the null hypothesis because the p-value is greater than the significant level of 0.05.
my_lm is a very useful function when it comes to linear regression. Here, we will perform a linear regression using lifeExp as our response variable and gdpPercap and continent as explanatory variables.
Note: The estimate column displays the coefficients. The last column displays Pr(>|t|), which is the p-value in two.sided one-sample t-test.
my_model <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) # display my_model my_model
The coefficient for gdpPercap(r my_model$Estimate[2]
) has postive value. It represents that the lifeExp increases r my_model$Estimate[2]
as gpdPercap increases one unit.
We can also have a hythothesis test on gdpPercap coefficient.
We first need to set a $H_0$ and a $H_a$. $$H_0: coef = 0$$ $$H_0: coef \neq 0$$ Then, we will test it using the significant level of 0.05. $$ \alpha = 0.05$$ We can see that the p-value, obtained from the table above, is much smaller than the significant value, so we can reject the null hypothesis and accept the alternative one.
Next, we want to how well our model can do in prediction. We need to pull out some models first: x is the explanatory variable and y is the response variable.
object <- lifeExp ~ gdpPercap + continent # extract the model model <- model.frame(object, my_gapminder) # extract x x <- model.matrix(object, my_gapminder) # extract y y <- model.response(model) %>% as.matrix()
We can visualize the actual value and the fitted value in a ggplot.
# fitted value # y_hat = x * Beta + se my_lifeExp <- x %*% my_model$Estimate + my_model$Std.Error my_df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = my_lifeExp, "color" = my_gapminder$continent) my_df %>% ggplot(aes(x = fitted, y = actual, color = color)) + geom_point() + geom_abline(slope = 1, intercept = 0) + labs(title = "Actual vs. Fitted Values", x = "Fitted Values", y = "Actual Values", color = "Continent") + theme_classic(base_size = 18) + theme(plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "floralwhite"), legend.title = element_text(hjust = 0.5), legend.position = c(0.8, .5), legend.justification = c("right", "top"), legend.box.just = "right", legend.box.background = element_rect(colour = "cyan"))
By coloring the continents, we can see that the fitted values have really fit linear relations with the actual values in Europe and Oceania, but not with other continents. Therefore, our linear model is not very good at predicting the entire data. In order to make the model more accurate, we may need to consider other regressions methods, for example, polynomial regression.
my_knn_cv is useful in predicting values by applying k nearest neighbor methods. It also uses cross-validation to train and evaluate models multiple times in order to improve accuracy. Here, we will make a prediction about output class continent using covariates gdpPercap and lifeExp from my_gapminder data.
Note: I highly recommend that you type ?my_knn_cv in the console to be familiar with each variable before going through the following tutorials.
We first need to pull our train and cl data out.
# pull out the training data train <- my_gapminder %>% select(gdpPercap, lifeExp) # pull out the true value of class data cl <- my_gapminder$continent
In order to better understand my_knn_cv function, we need to comprehend how cross-validation works. At first, cross-validation will split the data into k different folds. Then it will employ k-1 folds to train the model and the fold left to test the model. So, cross-validation is very useful in evaluating models. Once we have the model with a reasonable estimate of our out-of-sample test-error, we can use the full data to train our final predictive model.
Now, we are going to do the predictions ten times with k_cv = 5 and k_nn values from 1 to 10.
# create a matrix to store the output result <- matrix(NA, nrow = 10, ncol = 2) # loop throught the result matrix for (i in 1:10) { # obtain the output output <- my_knn_cv(train = train, cl = cl, k_nn = i, k_cv = 5) # cv misclassfication rate result[i, 1] <- output$cv_error # training misclassification rate result[i, 2] <- (sum(output$class != cl) / length(cl)) %>% round(4) } # convert it to data.frame in order to be displayed in a table result <- data.frame("Number of neighbors" = c(1:10), "cv misclassification rate" = result[, 1], "training misclassification rate" = result[, 2]) kable_styling(kable(result))
The table can tell us a lot of information. According to the table, we are able to decide which model to use. We want both the cv and the training misclassification rate as small as possible.
Based on the training misclassification rate, we will pick the model with the smallest rate, which is k_nn = 1. However, if we choose the model based on the cv misclassification rate, we will choose the model with k_nn = 10. In practice, I will probably choose the model with 5 or 6 neighbors because in those cases, both the cv and the training misclassification rate are in the acceptable range. However, our model is not very accurate since it has about a 50-percent-failure rate no matter how many folds and neighbors we choose.
my_rf_cv is a powerful tool to predict data by using random forest methods and cross-validation. Here, we are using this function to predict lifeExp in each country from the covariate gdpPercap.
We will do the demonstration by setting our k equal to 2, 5, and 10. For each k, we will run the functions 30 times.
Note: It will take several minutes to run given to the large number of ntree.
# create a matrix to store the output for each k cv_error <- matrix(NA, nrow = 90, ncol = 2) # first column is the value of k, each repeats 30 times cv_error[, 1] <- rep(c(2, 5, 10), each = 30) # rows begin at 1 row <- 1 #Iterate through k in c(2, 5, 10): for(k in c(2, 5, 10)) { #For each value of k, run your function 30 times. for(i in 1:30) { # record mse cv_error[row, 2] <- my_rf_cv(k) # go to the next row row <- row + 1 } }
To better visualize the estimated MSE, we will display it in boxplots.
# create a data.frame my_df <- data.frame("k" = cv_error[, 1], "mse" = cv_error[, 2]) my_df %>% ggplot(aes(x = factor(k), y = mse, fill = factor(k))) + geom_boxplot() + labs(title = "MSE for K folds", x = "Number of Folds", y = "MSE", fill = "Number of Folds") + theme_classic(base_size = 18) + theme(plot.title = element_text(hjust = 0.5), panel.background = element_rect(fill = "floralwhite"), legend.title = element_text(hjust = 0.5, size = 15), legend.margin = margin(6, 6, 6, 6))
We can also calculate the mean and standard deviation of MSE for each k. We will display the outcomes in a table.
mse_sum <- my_df %>% # group the data by the value of k group_by(k) %>% # caculate mean and sd of MSE for each k summarise(mean = mean(mse), sd = sd(mse)) kable_styling(kable(mse_sum))
From the boxplots, we can see that the range of MSE decreases and the median increases as k increases. We can also recognize the same trends in the table -- mean increases and standard deviation decreases as k increases. It occurs because as the number of folds increase, we are able to train the model with more data and evaluate it more times. More data as training data can increase the accuracy of prediction and more evaluations can decrease the variation of MSE.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.