knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

PROJECT3PACKAGE primarily contains functions that can be used for statistical inference and prediction. Functions include my_t.test for performing t-tests, my_lm, for building linear models, my_knn_cv for performing the k-nearest neighbors algorithm with k-fold cross-validation, and my_rf_cv for performing the random forest algorithm with k-fold cross-validation. Other functions include f_to_c for converting temperatures from Fahrenheit to Celsius and the all-important my_pow, which computes exponents.

Installation

You can install PROJECT3PACKAGE using the following lines:

install.packages("devtools")
devtools::install_github("seangrimm/PROJECT3PACKAGE", build_vignette = TRUE, build_opts = c())

Afterwards, functions from PROJECT3PACKAGE can be access after running the following:

library(PROJECT3PACKAGE)

my_t.test

The following shows example uses of my_t.test . Using the my_gapminder data set, we calculate the mean life expectancy in this data set is about 59.47 years. We will conduct three t-tests to assess the validity of the null hypothesis $H_{0} = 60$ with a significance level $\alpha = 0.05$.

data("my_gapminder")
life_expectancy <- my_gapminder[["lifeExp"]]
mean(life_expectancy)

alternative = "two.sided"

my_t.test(life_expectancy, "two.sided", 60)

Using $H_{a} \neq 60$, we see that the returned p-value is about 0.093. Since we used alternative = "two.sided", this precisely means that, assuming the mean life expectancy is indeed 60 years, there is a probability of 0.093 that another sample of the same size taken under the same conditions will be as or more extreme as the mean life expectancy of our data set. Using our significance level $\alpha = 0.05$, we would fail to reject the null hypothesis. As such, we cannot conclude that the mean life expectancy from our data set is not 60.

alternative = "less"

my_t.test(life_expectancy, "less", 60)

Using $H_{a} < 60$ yields a p-value of about 0.046. In this instance, this means that, assuming the true mean life expectancy is 60 years, there is a probability of 0.046 that the mean life expectancy of another sample of the same size taken under the same conditions will be less than or equal to the mean life expectancy of our data. Since the p-value is less than our significance level $\alpha = 0.05$, we would reject the null hypothesis in this case, thus concluding that the mean life expectancy of the data set is, in actuality, less than 60.

alternative = "greater"

my_t.test(life_expectancy, "greater", 60)

Lastly, using $H_{a} > 60$ returns a p-value of about 0.953. This precisely means that, assuming the mean life expectancy really is 60 years, there is a probability of 0.953 that the mean life expectancy of another same of the same size taken under the same conditions will be greater than or equal to the mean life expectancy of our data. This is far larger than our significance level $\alpha = 0.05$, and as such, we would fail to reject the null hypothesis. Once again, we would not be able to conclude that the mean life expectancy for the data set is not 60.

my_lm

The following demonstrates the use of my_lm. In this demonstration, we will once again be using the my_gapminder data set, wherein we will create a linear model to predict life expectancy per country based on GDP per capita and continent.

my_lm(lifeExp ~ gdpPercap + continent, my_gapminder)

The produced linear model suggests that whenever the GDP per capita for any given country increases by 1, the life expectancy increases by about .0004 years, as is seen in the Estimate column. The values of the Estimate column for the continent rows indicate the predicted increase in life expectancy for countries found in those respective continents.

The values in the Std. Error, t value and Pr(>|t|) columns are associated with a two-sided t-test conducted on the coefficients $\beta_{j}$ used in our linear model. For example, if we take $\beta_{1}$ to be our coefficient for GDP per capita, then our t-test on $\beta_{1}$ would be based on a null hypothesis $H_{0}: \beta_{1} = 0$ and an alternative hypothesis $H_{a}: \beta_{1} \neq 0$. The Pr(>|t|) column would then indicate the resulting p-value. For $\beta_{1}$, we see that the p_value is extremely small (about $8.55 \times 10^{-73}$). Using a significance level $\alpha = 0.05$, we would easily reject $H_{0}$. Thus, we are reasonably confident that GDP per capita does indeed have some effect on life expectancy.

library(ggplot2)
model_results <- predict(lm(lifeExp ~ gdpPercap + continent, my_gapminder))
compare_results <- data.frame(my_gapminder[["lifeExp"]], model_results)

ggplot(data = compare_results, aes(x = life_expectancy, y = model_results)) +
  geom_point(size = .1) +
  labs(title = "Actual vs. Predicted", x = "actual (years)",
       y = "predicted (years)") +
  theme_bw() +
  scale_x_continuous(limits = c(0, 110), breaks = seq(0, 110, 10)) +
  scale_y_continuous(limits = c(0, 110), breaks = seq(0, 110, 10))

The plot above shows the difference between the actual life expectancy shown in my_gapminder and the predicted life expected based on the model created by my_lm. Based on the plot, we can see that while my_gapminder shows a minimum life expectancy of about 23 years, the minimum life expectancy predicted by our model ends up being closer to 50 years. This is clearly due to the calculated y-intercept, which suggests that minimum life expectancy, regardless of country, would be more than 47 years. Many of the predicts made by the model are overly high as a result. On the other hand, we can also see that the direction of the model is mostly correct, especially with countries with a higher life expectancy. However, the model's fit is overall rather poor, which suggests that gdpPercap and continent are perhaps not the best indicators of life expectancy.

my_knn_cv

Next, we will demonstrate the use of my_knn_cv using the my_penguins data set. In this demonstration, we will attempt to predict the species using bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g. Ten different models will be created using 1 through 10 neighbors in the k-nearest neighbors algorithm. All models will use 5-fold cross validation.

# separate data
train <- na.omit(my_penguins)[,3:6]
cl <- na.omit(my_penguins)[1]

# build models, acquire predictions
knn_1 <- my_knn_cv(train, cl, 1, 5)
knn_2 <- my_knn_cv(train, cl, 2, 5)
knn_3 <- my_knn_cv(train, cl, 3, 5)
knn_4 <- my_knn_cv(train, cl, 4, 5)
knn_5 <- my_knn_cv(train, cl, 5, 5)
knn_6 <- my_knn_cv(train, cl, 6, 5)
knn_7 <- my_knn_cv(train, cl, 7, 5)
knn_8 <- my_knn_cv(train, cl, 8, 5)
knn_9 <- my_knn_cv(train, cl, 9, 5)
knn_10 <- my_knn_cv(train, cl, 10, 5)
knn_vect <- c(knn_1, knn_2, knn_3, knn_4, knn_5, knn_6, knn_7, knn_8, knn_9,
              knn_10)

# compile error data
miss_rate <- rep(0, 10)
cv_error <- rep(0, 10)

for (i in 1:10) {
  for (j in 1:length(knn_vect[(2 * i) - 1])) {
    miss_rate[i] <- mean(cl != as.vector(knn_vect[(2 * i) - 1][j]))
  }
  cv_error[i] <- knn_vect[2 * i]
}

cv_error <- unlist(cv_error)

error_table <- matrix(c(miss_rate, cv_error), nrow = 10, ncol = 2)
rownames(error_table) <- seq(1, 10)
colnames(error_table) <- c("Training misclassification",
                           "CV misclassification")
as.table(error_table)

The table above shows the change in training misclassification rate and cross-validation misclassification error using $k = 1,...,10$ neighbors for the k-nearest neighbors algorithm. Based on both the training misclassification rates and the cross-validation error, it would seem that using $k = 1$ in k-nearest neighbors yields the best model for predicting species based on bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g. However, in many situations, using $k = 1$ neighbors would likely produce an over-fitted model. As such, though I would keep the $k = 1$ model on hand to further test its effectiveness, I would choose a model that balances the above errors while also avoiding having an over-fitted model, such as $k = 4$.

Furthermore, we are potentially more confident in our model due to our use of k-fold cross-validation. In our case of using $k = 5$ for k-fold cross-validation, we split the training data up into five "folds." From these folds, five models were created, where each model would have one fold as the testing set while the remaining four folds are used to train the model. In doing this process, we are able to gain estimates for the performance of different models, which would allow us to perhaps choose the best of several different models produced.

my_rf_cv

Lastly, we will demonstrate the use of my_rf_cv, which takes data from my_penguins and builds a model to predict body_mass_g using bill_length_mm, bill_depth_mm, and flipper_length_mm. The function returns the cross-validation estimate from k-fold cross-validation using k = k.

set.seed(302)

cv_ests <- data.frame(rep(0, 90), rep(0, 90))
colnames(cv_ests) <- c("k", "cv_est")

for (i in 1:90) {
  if (i < 31) {
    cv_ests[i, 1] <- "2"
    cv_ests[i, 2] <- my_rf_cv(2)
  } else if (i < 61) {
    cv_ests[i, 1] <- "5"
    cv_ests[i, 2] <- my_rf_cv(5)
  } else {
    cv_ests[i, 1] <- "10"
    cv_ests[i, 2] <- my_rf_cv(10)
  }
}

cv_ests$k <- factor(cv_ests$k, levels = c("2", "5", "10"))

ggplot(data = cv_ests, aes(x = k, y = cv_est, group = k)) +
  geom_boxplot() +
  labs(title = "k-Fold Cross-Validation Estimate") +
  theme_bw()

vect_2 <- c(mean(cv_ests[which(cv_ests$k == "2"), ]$cv_est),
            sd(cv_ests[which(cv_ests$k == "2"), ]$cv_est))
vect_5 <- c(mean(cv_ests[which(cv_ests$k == "5"), ]$cv_est),
            sd(cv_ests[which(cv_ests$k == "5"), ]$cv_est))
vect_10 <- c(mean(cv_ests[which(cv_ests$k == "10"), ]$cv_est),
            sd(cv_ests[which(cv_ests$k == "10"), ]$cv_est))

cv_table <- matrix(c(vect_2, vect_5, vect_10), nrow = 3, ncol = 2, byrow = TRUE)
rownames(cv_table) <- c("k = 2", "k = 5", "k = 10")
colnames(cv_table) <- c("Mean", "SD")
as.table(cv_table)

From both the boxplots and the table above, we can see that the average cross-validation estimate decreases as we increase the number of folds used in k-fold cross-validation. Additionally, increasing the number of folds also leads to much less variance in the cross-validation estimate. This is likely due to the fact that increasing the number of folds also increases the size of the training set. This potentially improves the model's ability to make accurate predictions when it comes to the testing set, thus decreasing the cross-validation estimate. However, as overfitting our model with too much training data is certainly still a danger, this will not necessarily be the case every time. This perhaps explains why some fold separations, such as the one above, yield a higher cross-validation estimate for 10-fold cross-validation in comparison to 5-fold cross-validation.



seangrimm/PROJECT3PACKAGE documentation built on March 22, 2021, 1:52 p.m.