knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(project3package)

Introduction

This package (project3package) contains four functions and one data set. The functions are my_t_test, my_lm, my_knn_cv, and my_rf_cv. The function my_t_test performs a one sample t-test, my_lm fits a linear model, my_knn_cv predicts an output class using covariates from the same data set, and my_rf_cv predicts a variable using covariates from the same data set. The data set is my_gapminder, which includes data from the gapminder package.

To install this package, one can follow the instructions below (and displayed on https://github.com/arielle-min/project3package).

To summarize, you want to make a call to devtools to install github using install_github() from my github, which is "arielle-min/project3package". After this, one can call library(project3package) to be able to use my package.

# installs package
devtools::install_github("arielle-min/project3package")
# access to library
library(project3package)
# loads necessary libraries
library(tidyverse)
library(palmerpenguins)
library(class)
library(matrixStats)

Tutorial for my_t_test

lifeExp_data <- my_gapminder$lifeExp

Two.Sided Hypothesis:

# runs function with two.sided alternative
my_t_test(lifeExp_data, "two.sided", 60)

The p-value is .09322877, which means that there's about a .09 probability of observing a test statistic as or more extreme than what we observed given that the null hypothesis is true. Since .09 is greater than our alpha level, .05, this means we fail to reject the null hypothesis. There is not enough data to support that the mean life expectancy is not equal to 60.

Less Hypothesis:

# runs function with less alternative
my_t_test(lifeExp_data, "less", 60)

The p-value is .04661438, which means that there's about a .047 probability of observing a test statistic as or more extreme than what we observed given that the null hypothesis is true. Since .047 is less than our alpha level, .05, this means we reject the null hypothesis. This data supports that the mean life expectancy is less than 60.

Greater Hypothesis:

# runs function with greater alternative
my_t_test(lifeExp_data, "greater", 60)

The p-value is .9533856, which means that there's about a .95 probability of observing a test statistic as or more extreme than what we observed given that the null hypothesis is true. Since .95 is greater than our alpha level, .05, this means we fail to reject the null. There is not enough data to support that the mean life expectancy is greater to 60.

Tutorial for my_lm

Linear Model:

# runs function
lifeExp_lm <- my_lm(lifeExp ~ gdpPercap + continent, my_gapminder)
print(lifeExp_lm)

The coefficient (β) for gdpPercap is .0004452704, which is about 0. So the expected difference in life expectancy between two observations differing by one unit in gdp Per cap with all other covariates identical is about 0. This means that gdp Pe rcap has no impact on life expectancy.

Ho: β = 0

Ha: β ≠ 0

The p-value is 8.552893e-73, which is about 0 and means that there's about a 0 probability of observing a test statistic as or more extreme than what we observed given that the null hypothesis is true. Since 0 is less than our alpha level, .05, this means we reject the null hypothesis. This data supports that the the coefficient for gdpPercap is not equal to 0.

Plotting Values:

# stores the model estimates
my_estimates <- as.matrix(lifeExp_lm[,"Estimate"])

# creates of matrix of data that will be used in the model
model_data <- model.matrix(lifeExp ~ gdpPercap + continent, my_gapminder)

# multiples the matrix data to get the predicted output
lifeExp_predict <- model_data %*% my_estimates

# creates data frame of actual and fitted values
my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = lifeExp_predict)

# creates then prints a graph of actual vs fitted values
lifeExp_plot <- ggplot(my_df, aes(x = fitted, y = actual)) +
                  geom_point() +
                  geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) +
                  theme_bw(base_size = 16, base_family = "serif") +
                  labs(x = "Fitted values", y = "Actual values", 
                       title = "Actual vs. Fitted") +
                  theme(plot.title = element_text(hjust = 0.5))
print(lifeExp_plot)

With the model, we see that for the fitted values that are less than 65, there is very little accuracy with the model. However, for the fitted values that are between 65-85, there is much more accuracy. It seems that this model is only accurate for a small range of actual and fitted values. Overall, because the model is trying to predict values from the entire actual data set, I'd say that this model isn't very accurate and thus shouldn't be heavily relied upon.

Tutorial for my_knn_cv

# removes na values
penguins_omit <- na.omit(penguins) 

# selects specific columns for data frame
train <- penguins_omit %>% select(3, 4, 5, 6)

# creates empty matrices
cv_error <- matrix(NA, nrow = 10) 
train_error <- matrix(NA, nrow = 10) 

# runs function with a k_nn value from 1 to 10
for (i in 1:10) {
  cv_mis_value <- my_knn_cv(train, penguins_omit$species, k_nn = i, k_cv = 5)
  train_err_value <- knn(train, train, penguins_omit$species, k = i)
  train_err_value_avg <- 1 - mean(train_err_value == penguins_omit$species)
  cv_error[i] <- cv_mis_value$`cv error`
  train_error[i] <- train_err_value_avg
}

# produces a table 
error_table <- matrix(c(cv_error, train_error), ncol = 2)
colnames(error_table) <- c("cv error", "training error")
rownames(error_table) <- c(1:10)
error_table <- as.table(error_table)
print(error_table)

For training misclassification rates, I would choose the model with k_nn = 1 because the training error is the lowest. For CV misclassification rates, I would choose the model with k_nn = 1 because the misclassification rate is the lowest. (Due to randomness of splitting data into the folds, this current version may not have k_nn = 1 as having the lowest misclassification rate, but after running the function many different times, it would usually end up with k_nn = 1 having the lowest misclassification rate.)

In practice, I would choose the CV misclassification model because using training error model gives us the minimized error from our training data, but we can't generalize this error rate to our test data. When we use k_nn = 1, the model is essentially just copying over the input data (of 1 nearest neighbor) and using that to predict the output class. This is why the training error is 0, since the predicted output class the exact as the input data class. This is also why we can't generalize this error rate to the rest of our test data, since the output class with k_nn = 1 only really applies to one specific set of training data. However, with the cv misclassification model, we get an estimate of the test error which is the average of the misclassification error rates across the folds. Separating the data into folds allows us to use all but one fold of our data as training data and the remaining as test data. We make predictions using the test data and then repeat the process for all the folds. Within each fold we get a misclassification error rate, which allows us to get an average misclassification error rate for the whole data set. Since we get an average, this method is more generalizable, which is why it would be better in practice.

Tutorial for my_rf_cv

# creates empty matrices
cv_est_mse_2 <- matrix(NA, nrow = 30) 
cv_est_mse_5 <- matrix(NA, nrow = 30) 
cv_est_mse_10 <- matrix(NA, nrow = 30) 

# runs function 30 times
for (i in 1:30) {
  cv_muse_value <- my_rf_cv(2)
  cv_est_mse_2[i] <- cv_muse_value
}

# runs function 30 times
for (i in 1:30) {
  cv_muse_value <- my_rf_cv(5)
  cv_est_mse_5[i] <- cv_muse_value
}

# runs function 30 times
for (i in 1:30) {
  cv_muse_value <- my_rf_cv(10)
  cv_est_mse_10[i] <- cv_muse_value
}

# creates a data frame with values for each knn value
k_cv_est_mse <- data.frame("knn" = rep(c("2", "5", "10"), each = 30),
                           "cv_est_mse"  = c(cv_est_mse_2, cv_est_mse_5, cv_est_mse_10))

# creates then prints a graph displaying three boxplots
knn_mse_box <- ggplot(k_cv_est_mse, aes(x = knn, y = cv_est_mse)) + 
                  geom_boxplot(fill = "lightblue") +
                  labs(title = "CV estimated MSE based on knn", 
                       x = "knn", y = "CV estimated MSE") +
                  theme_bw(base_size = 16, base_family = "serif") +
                  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(knn_mse_box)

# creates a table displaying the mean and standard deviation
# for each of the knn values
cv_table <- matrix(c(colMeans(cv_est_mse_2), colSds(cv_est_mse_2), 
                     colMeans(cv_est_mse_5), colSds(cv_est_mse_5), 
                     colMeans(cv_est_mse_10), colSds(cv_est_mse_10)), 
                     ncol = 2, nrow = 3, byrow = TRUE)
colnames(cv_table) <- c("mean","standard deviation")
rownames(cv_table) <- c("knn = 2", "knn = 5", "knn = 10")
cv_table <- as.table(cv_table)
print(cv_table)

We see that for knn = 2, the boxplot shows that it has the largest range. The shape of the box itself is also rather long in length. The boxplots for knn = 5 and knn = 10 are much more similar looking. They both have a much smaller range and shorter boxes. It appears that the range for knn = 10 is the lowest. When we look at the table, we see that knn = 2, it has the largest mean and the largest standard deviation. knn = 10 has the lowest mean and the lowest standard deviation.

(Due to randomness of splitting data into the folds, this current version may not have the same results as the ones described above but after running the function many different times, it would typically end up with the same results as described above.)

When we increase our value for k nearest neighbors, we're increasing the number of surrounding data points that we use to predict the output. Since we're increasing the number, this provides a lower average cv estimate and lower standard error.



arielle-min/project3package documentation built on March 22, 2021, 1:47 p.m.