knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
Here, we introduce Stat302Project03
, a data analysis packages with the following functions: my_t_test
, my_lm
, my_knn_cv
, and my_rf_cv
which are developed based on t.test()
, lm()
, k-Nearest Neighbours
, and random forest
respectively.
We use gapminder
as our example dataset.
Install Stat302Project03
using:
devtools::install_github("Xiaoying-Z/Stat302Project03", build_vignette = TRUE, build_opts = c())
library(Stat302Project03) library(ggplot2) library(kableExtra)
my_t.test(my_gapminder$lifeExp, "two.sided", 60) # P = 0.09322877 my_t.test(my_gapminder$lifeExp, "less", 60) # P = 0.04661438 my_t.test(my_gapminder$lifeExp, "greater", 60) # P = 0.9533856
The first line of code evaluate the following hypothesis test:
$H_0: \mu = 60$
$H_a: \mu \neq 60$
With $\alpha = 0.05$ and $P = 0.0932$, we cannot draw a conclusion from this result.
The second line of code evaluate the following hypothesis test:
$H_0: \mu = 60$
$H_a: \mu < 60$
With $\alpha = 0.05$ and $P = 0.0466$, we reject $H_0$ in favor of $H_a$ that the observed mean is less than 60.
The third line of code evaluate the following hypothesis test:
$H_0: \mu = 60$
$H_a: \mu \neq 60$
With $\alpha = 0.05$ and $P = 0.953$, we have a p-value
larger than $\alpha = 0.05$, we cannot draw a conclusion from this result.
test <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) my_coef <- test[, 1] my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) y_hat <- my_matrix %*% as.matrix(my_coef) my_data <- data.frame("Actual" = my_gapminder$lifeExp, "Fitted" = y_hat, "Continent" = my_gapminder$continent) ggplot(my_data,aes(x = Actual, y = Fitted, color = Continent)) + geom_point() + theme_bw(base_size = 15) + labs(title = "Actual vs. Fitted values", caption = "Graph 1") + theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.5) ) coeff <- formatC(test$Estimate[2], digit = 3) pval <- formatC(test$Pr...t..[2], digit = 3)
The coefficient of gdpPercap
of the regression model is r coeff
. This tells us that there is a positive but not strong relation between gdpPercap
and lifeExp
.
The p-value
for the coefficient of gdpPercap
is r pval
, which is less than $\alpha = 0.05$, telling us that we can reject the null hypothesis "coefficient of gdpPercap
$= 0$" in favor of the alternative hypothesis: gdpPercap
is statistically significant for interpreting lifeExp
.
Graph 1 is the plot of Actual value vs. Fitted value, and the fitted values match the actual values well. The curved shapes demonstrate a non-linear growth of lifeExp
at a high gdpPercap
k_cv <- 5 cv_err <- rep(NA, k_cv) train_err <- rep(NA, k_cv) for (i in 1:10) { my_knn <- my_knn_cv(cbind(my_gapminder[, 4], my_gapminder[,6]), my_gapminder$continent, i, k_cv) cv_err[i] <- my_knn[[2]] train_err[i] <- my_knn[[3]] } my_err <- data.frame("knn" = c(1:10), "Training Error" = train_err, "CV Error" = cv_err) kable(my_err)
Based on the training error, I will choose the model with k_nn
$= 1$ because it has no error. However, if I look at CV misclassification rate, I will choose the model with k_nn
$= 7$ because the CV misclassification rate steadily decreases when k_nn
goes from $1$ to $7$ and starts to fluctuates after $7$.
In practice, I will choose k_nn
$= 7$. From the lecture, we know that k_nn
$= 1$ always gives the smallest error on training datasets, which is the result from overfitting, and the training error steadily increases as we increase the value of k_nn
. The test error, on the other hand, has a U-shape as we increasing the value of k_nn
. By using cross-validation, we are able to test our model on all data, and thus identify the optimal k_nn
value which minimize the test error. Therefore, in this case, the model with k_nn
$= 7$ avoids overfitting as well as gives a good prediction.
k = c(2, 5, 10) MSE = rep(NA, 3) cv_err <- matrix(NA, 3, 30) for (i in 1:3){ cur <- k[i] for (j in 1:30){ cv_err[i, j] <- my_rf_cv(cur) } MSE[i] = mean(cv_err[i,]) }
my_data = data.frame("k" = rep(c("2", "5", "10"), each = 30), "error" = c(cv_err[1,], cv_err[2,], cv_err[3,])) names(my_data)[2] = "error" ggplot(my_data, aes(x = k, y = error, group = k)) + geom_boxplot() + labs(title = "Boxplot for CV Estimated MSE", caption = "Graph 2") + theme_bw(base_size = 15) + theme(plot.title = element_text(hjust = 0.5), plot.caption = element_text(hjust = 0.5)) sds = rep(NA, 3) for (i in 1:3){ sds[i] = sd(cv_err[i, ]) } outputsd <- data.frame("k" = k, "mean errors" = MSE, "std.errors" = sds) kable(outputsd)
Both graph 2 and the table of error statistics show a decreasing std. error
as the number of folds we have in the data goes from $1$ to $10$. Variance of cross-validation is more U-shaped than linearly increasing. The std. error
decreases at first because the denominator increases as we increase the number of folds of the data, and it increases later because as $k$ gets arbitrarily large, the folds become highly correlated.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.