knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This is a compilation of various topics I have learned in Stat 302. It includes functions regarding hypothesis testing, linear modeling, and statistical prediction. You can use the following code to install project3package:

devtools::install_github("oliverlinehan41/project3package")
library(project3package)

One function in this package is the my_t_test function. This is a function that allows the user to perform a t_test on a set of data. First, I will demonstrate a two-sided t-test on the gapminder data regarding life-expectancy.

my_t_test(my_gapminder$lifeExp, "two.sided", 60)

The p-value from this test is ~ 0.09. This means that if we assume that the average life expectancy is 60 years, we could expect around 9% of samples of the same size to have a more extreme average life expectency value than we obtained, in either direction. With an alpha cutoff of 0.05, this is not significant enough to reject the null hypothesis that the average life expectancy is 60 years.

Next, I will do a one sided t-test on the same data, where the alternative hypothesis is that the average life expectancy is less than 60 years.

my_t_test(my_gapminder$lifeExp, "less", 60)

The p-value from this test is ~ 0.047. This means that if we assume that the average life expectancy is 60 years, we would expect that around 4.7% of samples of the same size would have a lower average life expectancy value than we obtained. With an alpha cutoff of 0.05, this is significant enough to reject the null hypothesis that the average life expectancy is 60 years, as it is likely lower.

Finally, I will do a one-sided t-test on the same data, where the alternative hypothesis is that the average life expectancy is greater than 60 years.

my_t_test(my_gapminder$lifeExp, "greater", 60)

The p-value from this test is ~0.95. This means that if we assume that the average life expectancy is 60 years, we would expect that around 95% of samples of the same size would have a higher average life expectancy value than we obtained. With an alpha cutoff of 0.05, this is not sufficient evidence to reject the null hypothesis that the average life expectancy is 60 years.

Another function in project3package is the my_lm function. This function allows you to build a linear model for a set of data. For example, the following code builds a linear model of how GDP per capita and continent correlate with life expectency in the gapminder data.

my_lm(lifeExp ~ gdpPercap + continent, my_gapminder)

The "estimate" value is the linear coefficient for each variable. For gdpPercap, for example, the coefficient is 4.45 * 10^-4. This means that all else being equal, a $10,000 increase in GDP per capita would mean that our predicted life expectency would increase by 4.45 years. For a hypothesis test, our null hypothesis would be that GDP per capita and life expectency do not have a relationship, which means that our linear coefficient for gdpPercap would be 0. Our alternative hypothesis is that gdpPercap and life expectency do have a relationship. The p-value for the linear coefficient for gdpPercap is 8.55*10^-73, which is very small. This means that if our true coefficient is 0, only an astronomically tiny proportion of samples would have a more extreme coefficient, in either direction. With a cutoff of 0.05, we can conclude that there is evidence to support a relationship between GDP per capita and life expectancy.

To examine the effectiveness of the model, we can plot the estimated/fitted values for life expectancy with the actual values from the gapminder data.

library(ggplot2)
fit_mod <- fitted(lm(lifeExp ~ gdpPercap + continent, my_gapminder))
lm_df <- data.frame(actual = my_gapminder$lifeExp, fitted = fit_mod)
ggplot(lm_df, aes(x = fitted, y = actual)) +
  geom_point()+
  geom_abline(slope = 1, intercept = 0, col = "blue", lty = 2) +
  theme_bw() +
  labs(x = "Fitted Values", y = "Actual Values", title = "Actual vs. Fitted") +
  theme(plot.title = element_text(hjust = 0.5))

As demonstrated by the deviation of a linear pattern between fitted and actual values, this model is not a good fit for the data. When the predicted life expectancy is greater than 70 and less than 85, the model does pretty well, but when it's less than 70, or greater than 85, the model pretty much completely loses any accuracy.

Another function in project3package is the my_knn_cv() function. This function allows the user to perform cross-validation on training data in order to determine which k-value allows for the best predictive model, using k-nearest-neighbors. The training data is divided into folds. Each fold is used as test data once, with the other folds used as training data. The model predicts the class of the test data observations using the training data observations. This is useful, as merely relying on the data at hand to predict itself ends up being too optimistic, as the k with the lowest training error will actually create an overfit model, which describes the data well, but isn't as good at predicting new data. The k with the lowest CV error doesn't fit the data itself as well, but is more reliable when it comes to predicting new data. For example, the following code performs the above on the penguins data for k=1 through k=10. The output is a vector of the cross-validation error for each value of k. For cross validation, the data is divided into a chosen number of folds (5 in the following case).

library(class)
library(dplyr)
cverr <- vector(mode = "numeric", length = 10)
for(m in 1:10) {
  x <- my_knn_cv(my_penguins, my_penguins$species, m, 5)
  cverr[m] <- x[2]
}
cverr

The following code will produce the training error, or the error when we use the training data as the test data.

train_err <- vector(mode = "numeric", length = 10)
nn_class <- vector(mode = "character")
for (q in 1:10) {
  nn <- my_knn_cv(my_penguins, my_penguins$species, q, 5)
  nn_class <- unlist(nn[1])

  train_correct <- vector(mode = "numeric", length = length(nn_class))

  for (w in 1:length(nn_class)) {
    if (nn_class[w] == my_penguins$species[w]) {
      train_correct[w] <- 0

    }
    else {
      train_correct[w] <- 1
    }

  }

  train_err[q] <- mean(train_correct)



}
train_err

Based on the training error, I would choose k = 2, as it has the lowest training error. However, using CV error, I would choose k = 1, as that has the lowest CV error. In practice, I would choose k = 1, as k = 2 would be overfit. In other words, k = 2 fits the training data better than k = 1, however, it is less robust than k = 1 when it comes to predicting new data. Thus, as knn is used for prediction, it makes more sense to choose k = 1.

The final function in project3package is my_rf_cv(). This function is similar to the my_knn_cv, but it instead performs cross validation on the random forest function on the palmerpenguins data. The following code will produce boxplots and a table to demonstrate the cross-validation standard errors for 3 values of k, representing the number of folds in the random forest (2, 5, and 10).

library(randomForest)
cv_2 <- vector(mode = "numeric", length = 30)
cv_5 <- vector(mode = "numeric", length = 30)
cv_10 <- vector(mode = "numeric", length = 30)

for (m in 1:30) {
  cv_2[m] <- my_rf_cv(2)
  cv_5[m] <- my_rf_cv(5)
  cv_10[m] <- my_rf_cv(10)

}

rf_df <- data.frame(folds = c(rep("2", 30), rep("5", 30), rep("10", 30)),
                    mean_cv_se = c(cv_2, cv_5, cv_10))

ggplot(data = rf_df, aes(x = reorder(folds, -mean_cv_se), y = mean_cv_se)) +
  geom_boxplot(fill = "lightblue") +
  theme_bw() +
  labs(title = "Mean Cross-Validation Standard Error of K folds",
       x = "Number of Folds",
       y = "Mean Cross-Validation Standard Error") +
  theme(plot.title = element_text(hjust = 0.5, size = 8),
        axis.title.x = element_text(size = 10),
        axis.title.y = element_text(size = 10))

means <- c(mean(cv_2), mean(cv_5), mean(cv_10))
stdvs <- c(sd(cv_2), sd(cv_5), sd(cv_10))
rf_tab <- as.table(cbind(means, stdvs))
rownames(rf_tab) <- c("k=2", "k=5", "k=10")
rf_tab

With 2 folds, according to the boxplot, the standard error is higher than with 5 or 10 folds, and the spread appears much wider. 10 folds has the lowest standard error, and the narrowest spread. This is also represented table, where k=2 has the highest mean and highest standard deviation for cross-validation standard error, and k=10 has the lowest for each. The higher standard deviation for standard error for k=2 is likely because with fewer folds, there are fewer individual error components, so there is more variation in the data. The higher standard error itself is because k=10 is more fit to the data than k=2.



oliverlinehan41/project3package documentation built on Dec. 22, 2021, 4:21 a.m.