knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The goal of stat302package is to perform t-test, linear model regression, k-nearest-neighbors algorithm and random forest algorithm on given data.
You can install the the package from GitHub using:
devtools::install_github("https://github.com/RuofengT/stat302package")
library(stat302package)
my_t.test
This function performs t-test on given data. It takes a numeric vector x
of data, a a string specifying types of t test wanted with choices two.sided
, less
or greater
, and it takes a number mu
indicating the null hypothesis. It then conducts t test with given parameters, and returns a list including: the numeric test statistic, the degrees of freedom, the type of t test given, and the numeric p-value.
Performing t test on lifeExp
data from my_gapminder
, with H_0: mu = 60, H_a: mu ≠ 60:
my_t.test(my_gapminder$lifeExp, "two.sided", 60)
The p-value is larger than alpha = 0.05, so we cannot conclude anything;
Performing t test on lifeExp
data from my_gapminder
, with H_0: mu = 60, H_a: mu < 60:
my_t.test(my_gapminder$lifeExp, "less", 60)
The p-value is smaller than alpha = 0.05, so we should reject null hypothesis H_0: mu = 60, meaning that we should not believe that lifeExp
has mean value 60.
Performing t test on lifeExp
data from my_gapminder
, with H_0: mu = 60, H_a: mu > 60:
my_t.test(my_gapminder$lifeExp, "greater", 60)
The p-value is larger than alpha = 0.05, so we cannot conclude anything.
my_lm
This function performs linear regression on given data. It takes a linear model formula
and the dataset data
. Then it returns The estimate, the standard error, the t value, and Pr(>|t|) for each coefficient including the intercept.
Performing linear regression on lifeExp
from my_gapminder
based on gdpPercap
and continent
.
# creates the formula for linear regression formula <- lifeExp ~ gdpPercap + continent # performs linear regression results_lm <- my_lm(formula, my_gapminder) # display regression results results_lm
The gdpPercap
coefficient is r results_lm[2, 1]
. It means that per unit of gdpPercap
has around r results_lm[2, 1]
unit of influence on lifeExp
. If we look at my_gapminder
data, we see that gdpPercap
ranges from around 200 to 100,000, while lifeExp
ranges from around 23 to 82. It makes sense that per unit of gpdPercap
does not influence lifeExp
much.
The hypothesis test associated with gdpPercap
coefficient = beta is that H_0: beta = 0, H_a: beta ≠ 0.
The associated p-value is Pr(>|t|) in the results for gdpPercap
: r results_lm[2, 4]
. It is much smaller than alpha = 0.05, so we should reject H_0: beta = 0; in other words, we reject that gdpPercap
has no influence on lifeExp
.
Plot actual vs. fitted values:
library(ggplot2) # check levels in my_gapminder$continent levels(my_gapminder$continent) # get numerical information from my_gapminder$continent continent_vals <- as.numeric(my_gapminder$continent) # replace numerical information with beta_hats continent_vals[continent_vals == 1] <- 0 continent_vals[continent_vals == 2] <- results_lm[3, 1] continent_vals[continent_vals == 3] <- results_lm[4, 1] continent_vals[continent_vals == 4] <- results_lm[5, 1] continent_vals[continent_vals == 5] <- results_lm[6, 1] # produce fitted values y_hat <- my_gapminder$gdpPercap * results_lm[2, 1] + continent_vals + results_lm[1, 1] # plot actual vs. fitted graph my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = y_hat) ggplot(my_df, aes(x = fitted, y = actual)) + geom_point() + geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) + theme_bw(base_size = 15) + labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + theme(plot.title = element_text(hjust = 0.5))
Ideally, the points should cluster along the red dotted line y = x, that is "actual equals fitted". The plot we have do cluster along the red dotted line, but it has huge variances at places like fitted values around 50, 60 or 65. This suggests that our linear model captures some features in data, but it is not good enough to fit the data.
my_knn_cv
This function implements k-nearest-neighbors algorithm on given data, and performs k-fold cross validation. It takes the input data frame, the true class value of the training data, the number of neighbors and the number of folds. It then returns a list with objects "class": a vector of the predicted class for all observations; "cv_err": a numeric with the cross-validation misclassification error.
Using my_penguins
data, predict output class species
using covariates bill_length_mm
, bill_depth_mm
, flipper_length_mm
, and body_mass_g
with k = 5 folds, and repeat for i nearest neighbors, i = 1 : 10. Then decide which model is better based on training misclassification rates and CV misclassification rates.
library(class) library(dplyr) # filter useful columns in my_penguins dataset dat <- na.omit(my_penguins[, c("species", "bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")]) # store true classifications of dataset cl <- dat$species # leave out true classifications dat <- dat[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")] # creates an empty matrix to store training and CV misclassification rates results_knn <- matrix(NA, 10, 2) # perform k-nearest-neighbor classification with 5 folds, iterating through # k = 1: 10. for (i in 1 : 10) { # records training misclassification rate results_knn[i, 1] <- sum(knn(dat, dat, cl, k = i) != cl) / nrow(dat) # records CV misclassification rate results_knn[i, 2] <- my_knn_cv(dat, cl, i, 5)[[2]] } # print out the results results_knn
Based on training misclassification rate, I would choose k = 1, training misclassification rate lowest = 0, 1-nearest-neighbor;
Based on CV misclassification rate, I would choose k = 1 as well, CV misclassfication rate lowest = r results_knn[1, 2]
, 1-nearest-neighbor.
In practice, I would choose k = 1 solely based on it having the lowest CV misclassification rate. k-fold cross validation divides the dataset into k folds, and use k - 1 training folds to predict 1 test fold. It will be repeated k times, and every fold will be predicted based on other k - 1 folds. This way, the whole dataset is both training set and test set at the same time. We used all data to train the model, but it is also tested well. CV misclassification rate is an indicator to how accurate the model is, so we should choose the most accurate model based on it.
my_rf_cv
This function implements random forest algorithm and k-fold cross validation together on given data. It takes the input data frame and the number of folds. It then returns a numeric with the cross validation error.
Using my_penguins
data, predict body_mass_g
using covariates bill_length_mm
, bill_depth_mm
, and flipper_length_mm
with k = 2, 5, 10. For each k, run the function 30 times and stores all MSE errors.
library(randomForest) # filter useful columns in my penguins dataset dat2 <- na.omit(my_penguins[, c("body_mass_g", "bill_length_mm", "bill_depth_mm", "flipper_length_mm")]) # creates a matrix to store all MSE errors results_rf <- matrix(NA, 30, 3) # store k values wanted count <- c(2, 5, 10) # iterate through k = 2, 5, 10 for (i in 1 : 3) { for (j in 1 : 30) { # store MSE error for required k results_rf[j, i] <- my_rf_cv(dat2, count[i]) } }
Use ggplot2
with 3 boxplots to display these data:
# plot when k = 2 ggplot() + geom_boxplot(aes(y = results_rf[, 1])) + scale_x_discrete( ) + labs(x = "k = 2", y = "MSE Errors", title = "30 MSE Errors when k = 2") # plot when k = 5 ggplot() + geom_boxplot(aes(y = results_rf[, 2])) + scale_x_discrete( ) + labs(x = "k = 5", y = "MSE Errors", title = "30 MSE Errors when k = 5") # plot when k = 10 ggplot() + geom_boxplot(aes(y = results_rf[, 3])) + scale_x_discrete( ) + labs(x = "k = 10", y = "MSE Errors", title = "30 MSE Errors when k = 10")
Notice that as k increases, the MSE Errors generally decrease significantly.
Use a table to display the average CV estimate and the standard deviation of the CV estimates across k:
library(kableExtra) # creates an empty data frame to store values results_table <- data.frame(matrix(NA, 3, 2)) rownames(results_table) = c("k = 2", "k = 5", "k = 10") colnames(results_table) = c("CV estimate", "Standard Deviation") # store average CV estimates and standard deviations across k for (i in 1 : 3) { results_table[i, 1] <- mean(results_rf[, i]) results_table[i, 2] <- sd(results_rf[, i]) } # presents the table kable_styling(kable(results_table))
As k increases, the average CV estimate slightly decreases and the standard deviation decreases significantly. I think this is because as the number of folds increases, the number of test fold observations to be predicted decreases and train fold observations increases, resulting in better models with lower bias. So the CV estimates decreases, and they become more similar as k increases.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.