In anikalindley/project3package: Project 3 Package for STAT 302

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The package project3package contains four functions. my_t.test can be used to make statistical inferences and my_lm can be used to make predictions as well as inferences. my_knn_cv and my_rf_cv use cross validation to assess the predictive performance of models with predictions made using k-nearest neighbors and random forest, respectively.

Installation

To install project3package from GitHub, run the following code in your console:

install.packages("devtools")
devtools::install_github("anikalindley/project3package", build_vignette = TRUE, build_opts = c())

library(project3package)

my_t.test

To demonstrate the use of my_t.test, I will examine the mean age of life expectancy from the my_gapminder dataset. For all three trials, $H_{0} = 60$, and I will use my_t.test to determine the likelihood that the life expectancy is not 60 years, is less than 60 years, or is greater than 60 years.

# load gapminder data
data("my_gapminder")

# store life expectancy in life_expectancy variable
life_expectancy <- my_gapminder$lifeExp

Two sided hypothesis test

# store my_t.test output for two sided test
test_two.sided <- my_t.test(x = life_expectancy, alternative = "two.sided", mu = 60)
test_two.sided

The p-value for the trial with $H_{a} \neq 60$ is r test_two.sided[[4]], which is greater than $\alpha = 0.05$. This means that there is a r test_two.sided[[4]] chance that you would get a value at least this extreme, assuming $H_{0} = 60$ is correct. Therefore, we fail to reject the null hypothesis that the mean age is 60 years.

One sided hypothesis test where the alternative hypothesis is mu < 60

# store output of my_t.test for one sided test 
test_less <- my_t.test(x = life_expectancy, alternative = "less", mu = 60)
test_less

The p-value for the trial with $H_{a} < 60$ is r test_less[[4]], which is less than $\alpha = 0.05$. This means that there is a r test_less[[4]] chance that you would get a value at least this extreme, assuming $H_{0} = 60$ is correct. Therefore, we can reject the null hypothesis in favor of the alternative hypothesis, that the true mean life expectancy is less than 60.

One sided hypothesis test where the alternative hypothesis is mu > 60

# store output of my_t.test for one sided test
test_greater <- my_t.test(x = life_expectancy, alternative = "greater", mu = 60)
test_greater

The p-value for the trial with $H_{a} > 60$ is r test_greater[[4]], which is greater than $\alpha = 0.05$. This means that there is a r test_greater[[4]] chance that you would get a value at least this extreme, assuming $H_{0} = 60$ is correct. Therefore, we fail to reject the null hypothesis that the mean age is 60 years.

my_lm

To demonstrate the use of my_lm I will create a regression using life expectancy as the response variable and GDP per capita and continent as the explanatory variables. I will use data from my_gapminder.

# store output of my_lm using gapminder data
my_model <- my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder)
my_model

The coefficient for gdpPercap is r my_model[2,1], which means that for each one unit increase of gdpPercap, lifeExp increases by r my_model[2,1] years.

The hypothesis test for the gdpPercap coefficient is $H_{0} = 0$ and $H_{a} \neq 0$. We can use the value from the column Pr(>|t|) for the hypothesis test. The p-value, r my_model[2, 4] is less than $\alpha = 0.05$; therefore, we can reject the null hypotheses in favor of the alternative hypothesis, suggesting that the true change in life expectancy for each one unit increase of GDP per capita is not zero.

The following plot compares the actual vs. fitted values:

library(ggplot2)
my_x <- model.matrix(my_gapminder$lifeExp ~ gdpPercap + continent, data = my_gapminder)
# fitted model
fitted <- my_x %*% as.matrix(my_model[, 1])
# data frame with actual and fitted values
my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = fitted, continent = my_gapminder$continent)

# plot actua vs. fitted
ggplot(my_df, aes(x = actual, y = fitted, color = continent)) + 
  geom_point(size = 0.7) + 
  theme_bw(base_size = 12) + 
  labs(x = "Actual (years)", y = "Fitted (years)", title = "Actual vs. Fitted") + 
  theme(plot.title = element_text(hjust = 0.5)) + 
  # adjust axis
  xlim(20, 90) + 
  ylim(20, 90)

The fitted values have a much smaller range than the actual values. This indicates that the model fit is not very good because if the fit were good we would expect to see wider, diagonal lines rather than flat, narrow lines.

my_knn_cv

To demonstrate the use of my_knn_cv, I will predict the class species of penguins using covariates bill length, bill depth, flipper length,and body mass and will use 5-fold cross validation to assess the predictions.

library(magrittr)
library(dplyr)
# remove NA values
penguin <- na.omit(my_penguins)
# select covariates
my_train <- penguin %>% dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)
# true species (class) of penguins
my_cl <- penguin %>% dplyr::pull(species)

# variable to store training error
training_err <- matrix(NA, 10, 1)
# variable to store missclassification error
CV_misclassification <- matrix(NA, 10, 1)

for(i in 1:10) {
# predict species using 1 through 10 nearest neighbors and 5-fold cross-validation
my_knn <- my_knn_cv(my_train, my_cl, i, 5)
# add training error for each iteration
training_err[i] <- length(which(penguin$species != my_knn[[1]])) / nrow(penguin)
# add missclassification error for each iteration
CV_misclassification[i] <- my_knn[[2]]
}

training_err
CV_misclassification

Based on the training misclassification rates, I would choose the model where we use 1-nearest neighbor because the training error rate is r training_err[1]. Based on the CV misclassification rates, I would also choose the model with 1-nearest neighbor because the error rate is the lowest. While the model decision is the same using both errors, I would use the CV misclassification error to justify this decision. This is because the training misclassification error for 1-nearest neighbor is always 0 since the first nearest neighbor is the data point itself. The CV misclassifiation rate uses data withheld from the training set to make predictions, then compares those against the true class of the withheld data. This is more reliable because we are not using the datapoint itself to make a prediction as to which class of penguin it is.

my_rf_cv

To demonstrate the use of my_rf_cv I will predict the body mass of penguins using bill length, bill depth, and flipper length as covariates.

# matrix to store random forest cross-validation output
my_rf <- matrix(NA, 30, 3)

# iterate through k = 2, k = 5, k = 10
for (i in c(2, 5, 10)) {
  # repeat each iteration 30 times
  for(j in 1:30) {
    if (i == 2) {
      # add to column k = 2
       my_rf[j, 1] <- my_rf_cv(i)
    } else if (i == 5) {
      # add to column k = 5
       my_rf[j, 2] <- my_rf_cv(i)
    } else {
      # add to column k = 10
       my_rf[j, 3] <- my_rf_cv(i)
    }
  }
}

library(reshape2)
library(ggplot2)

# store my_rf as data frame
my_df <- data.frame(my_rf)
# name columns
colnames(my_df) <- c("k = 2", "k = 5", "k = 10")
# assign variable names to each value
my_df <- melt(my_df, measure.vars = c("k = 2", "k = 5", "k = 10"))

# create a boxplot for each value of k
ggplot(data = my_df, aes(x = variable, y = value)) + 
  geom_boxplot() + 
  theme_bw(base_size = 12) + 
  labs(x = "Number of Folds", y = "Average MSE")

# create table to store CV MSE and SD
my_table <- matrix(NA, 3, 2)
my_df2 <- data.frame(my_rf)

# average CV MSE for each value of k
my_means <- colMeans(my_df2)
# standard deviation for each value of k
my_sd <- apply(my_df2, 2, sd)

# add average CV MSE to table
my_table[, 1] <- my_means
# add standard deviation to table
my_table[, 2] <- my_sd

# add column names
colnames(my_table) <- c("Avg MSE", "SD")
# add row names
row.names(my_table) <- c("k = 2", "k = 5", "k = 10")

my_table

As the number of folds increases, the average mean squared error of the predictions decreases and the standard deviation of the predictions decreases. This suggests that the 10-fold random forest cross validation model has the best predictive performance, as the average mean squared error and standard deviation are lower than the 2- and 5-fold models. In general, as k increases there is a chance of overfitting because more of the full data set is being used to make the predictions and less is left to test the accuracy of the predictions. In this case, I believe that the 10-fold model leaves enough data left to accurately assess the model and yields the best predictions compared to the other models.

anikalindley/project3package documentation built on Dec. 19, 2021, 3:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com