knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The package project3package
contains four functions. my_t.test
can be used to make statistical inferences and my_lm
can be used to make predictions as well as inferences. my_knn_cv
and my_rf_cv
use cross validation to assess the predictive performance of models with predictions made using k-nearest neighbors and random forest, respectively.
To install project3package
from GitHub, run the following code in your console:
install.packages("devtools") devtools::install_github("anikalindley/project3package", build_vignette = TRUE, build_opts = c())
library(project3package)
To demonstrate the use of my_t.test
, I will examine the mean age of life expectancy from the my_gapminder
dataset. For all three trials, $H_{0} = 60$, and I will use my_t.test
to determine the likelihood that the life expectancy is not 60 years, is less than 60 years, or is greater than 60 years.
# load gapminder data data("my_gapminder") # store life expectancy in life_expectancy variable life_expectancy <- my_gapminder$lifeExp
# store my_t.test output for two sided test test_two.sided <- my_t.test(x = life_expectancy, alternative = "two.sided", mu = 60) test_two.sided
The p-value for the trial with $H_{a} \neq 60$ is r test_two.sided[[4]]
, which is greater than $\alpha = 0.05$. This means that there is a r test_two.sided[[4]]
chance that you would get a value at least this extreme, assuming $H_{0} = 60$ is correct. Therefore, we fail to reject the null hypothesis that the mean age is 60 years.
# store output of my_t.test for one sided test test_less <- my_t.test(x = life_expectancy, alternative = "less", mu = 60) test_less
The p-value for the trial with $H_{a} < 60$ is r test_less[[4]]
, which is less than $\alpha = 0.05$. This means that there is a r test_less[[4]]
chance that you would get a value at least this extreme, assuming $H_{0} = 60$ is correct. Therefore, we can reject the null hypothesis in favor of the alternative hypothesis, that the true mean life expectancy is less than 60.
# store output of my_t.test for one sided test test_greater <- my_t.test(x = life_expectancy, alternative = "greater", mu = 60) test_greater
The p-value for the trial with $H_{a} > 60$ is r test_greater[[4]]
, which is greater than $\alpha = 0.05$. This means that there is a r test_greater[[4]]
chance that you would get a value at least this extreme, assuming $H_{0} = 60$ is correct. Therefore, we fail to reject the null hypothesis that the mean age is 60 years.
To demonstrate the use of my_lm
I will create a regression using life expectancy as the response variable and GDP per capita and continent as the explanatory variables. I will use data from my_gapminder
.
# store output of my_lm using gapminder data my_model <- my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder) my_model
The coefficient for gdpPercap is r my_model[2,1]
, which means that for each one unit increase of gdpPercap, lifeExp increases by r my_model[2,1]
years.
The hypothesis test for the gdpPercap coefficient is $H_{0} = 0$ and $H_{a} \neq 0$. We can use the value from the column Pr(>|t|) for the hypothesis test. The p-value, r my_model[2, 4]
is less than $\alpha = 0.05$; therefore, we can reject the null hypotheses in favor of the alternative hypothesis, suggesting that the true change in life expectancy for each one unit increase of GDP per capita is not zero.
The following plot compares the actual vs. fitted values:
library(ggplot2) my_x <- model.matrix(my_gapminder$lifeExp ~ gdpPercap + continent, data = my_gapminder) # fitted model fitted <- my_x %*% as.matrix(my_model[, 1]) # data frame with actual and fitted values my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = fitted, continent = my_gapminder$continent) # plot actua vs. fitted ggplot(my_df, aes(x = actual, y = fitted, color = continent)) + geom_point(size = 0.7) + theme_bw(base_size = 12) + labs(x = "Actual (years)", y = "Fitted (years)", title = "Actual vs. Fitted") + theme(plot.title = element_text(hjust = 0.5)) + # adjust axis xlim(20, 90) + ylim(20, 90)
The fitted values have a much smaller range than the actual values. This indicates that the model fit is not very good because if the fit were good we would expect to see wider, diagonal lines rather than flat, narrow lines.
To demonstrate the use of my_knn_cv
, I will predict the class species of penguins using covariates bill length, bill depth, flipper length,and body mass and will use 5-fold cross validation to assess the predictions.
library(magrittr) library(dplyr) # remove NA values penguin <- na.omit(my_penguins) # select covariates my_train <- penguin %>% dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) # true species (class) of penguins my_cl <- penguin %>% dplyr::pull(species) # variable to store training error training_err <- matrix(NA, 10, 1) # variable to store missclassification error CV_misclassification <- matrix(NA, 10, 1) for(i in 1:10) { # predict species using 1 through 10 nearest neighbors and 5-fold cross-validation my_knn <- my_knn_cv(my_train, my_cl, i, 5) # add training error for each iteration training_err[i] <- length(which(penguin$species != my_knn[[1]])) / nrow(penguin) # add missclassification error for each iteration CV_misclassification[i] <- my_knn[[2]] } training_err CV_misclassification
Based on the training misclassification rates, I would choose the model where we use 1-nearest neighbor because the training error rate is r training_err[1]
. Based on the CV misclassification rates, I would also choose the model with 1-nearest neighbor because the error rate is the lowest. While the model decision is the same using both errors, I would use the CV misclassification error to justify this decision. This is because the training misclassification error for 1-nearest neighbor is always 0 since the first nearest neighbor is the data point itself. The CV misclassifiation rate uses data withheld from the training set to make predictions, then compares those against the true class of the withheld data. This is more reliable because we are not using the datapoint itself to make a prediction as to which class of penguin it is.
To demonstrate the use of my_rf_cv I will predict the body mass of penguins using bill length, bill depth, and flipper length as covariates.
# matrix to store random forest cross-validation output my_rf <- matrix(NA, 30, 3) # iterate through k = 2, k = 5, k = 10 for (i in c(2, 5, 10)) { # repeat each iteration 30 times for(j in 1:30) { if (i == 2) { # add to column k = 2 my_rf[j, 1] <- my_rf_cv(i) } else if (i == 5) { # add to column k = 5 my_rf[j, 2] <- my_rf_cv(i) } else { # add to column k = 10 my_rf[j, 3] <- my_rf_cv(i) } } }
library(reshape2) library(ggplot2) # store my_rf as data frame my_df <- data.frame(my_rf) # name columns colnames(my_df) <- c("k = 2", "k = 5", "k = 10") # assign variable names to each value my_df <- melt(my_df, measure.vars = c("k = 2", "k = 5", "k = 10")) # create a boxplot for each value of k ggplot(data = my_df, aes(x = variable, y = value)) + geom_boxplot() + theme_bw(base_size = 12) + labs(x = "Number of Folds", y = "Average MSE")
# create table to store CV MSE and SD my_table <- matrix(NA, 3, 2) my_df2 <- data.frame(my_rf) # average CV MSE for each value of k my_means <- colMeans(my_df2) # standard deviation for each value of k my_sd <- apply(my_df2, 2, sd) # add average CV MSE to table my_table[, 1] <- my_means # add standard deviation to table my_table[, 2] <- my_sd # add column names colnames(my_table) <- c("Avg MSE", "SD") # add row names row.names(my_table) <- c("k = 2", "k = 5", "k = 10") my_table
As the number of folds increases, the average mean squared error of the predictions decreases and the standard deviation of the predictions decreases. This suggests that the 10-fold random forest cross validation model has the best predictive performance, as the average mean squared error and standard deviation are lower than the 2- and 5-fold models. In general, as k increases there is a chance of overfitting because more of the full data set is being used to make the predictions and less is left to test the accuracy of the predictions. In this case, I believe that the 10-fold model leaves enough data left to accurately assess the model and yields the best predictions compared to the other models.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.