In kobesar/mypackage: Project 3 Package for STAT 302

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(mypackage)
library(ggplot2)
library(dplyr)
library(class)

data("my_penguins")
data("my_gapminder")

mypackage is a useful statistical package that includes many useful functions for statistical computing! It includes functions: my_t.test, my_lm, my_knn_cv, and my_rf_cv. It also includes two datasets: my_penguins and my_gapminder.

To install mypackage onto R, run the following line:

# install.packages("devtools")
devtools::install_github("kobesar/mypackage")

To load the mypackage run the following line:

library(mypackage)

We can load my_penguins and my_gapminder with the following lines:

data("my_penguins")
data("my_gapminder")

Also, in this tutorial, we will be using ggplot2, dplyr, and class to help demonstrate how to use the functions included in mypackage.

To run the following packages run the following lines:

# install.packages(c("ggplot", "dplyr", "class"))
library(ggplot2)
library(dplyr)
library(class)

my_t.test tutorial

my_t.test performs a t-test, given a vector of values, a character indicating the type of test, and a numeric indicating the testing mean. The alternative can be either "two.sided", "less", or "greater", otherwise, it will throw an error.

We will use the lifeExp column from the my_gapminder data to apply a t-test.

First we will perform a two sided t-test where our alternative hypothesis states that the true mean is not equal to 60.

x <- my_gapminder$lifeExp # vector of values
alternative <- "two.sided" # indicates we are testing 
mu <- 60 # our mean

my_t.test(x, alternative, mu)

We can see from the printed list that the p-value is larger than our $\alpha$ value of 0.05 so we fail to reject the null hypothesis that the true mean is equal to 60.

Then, we perform our t-test on the alternative hypothesis that the mean is less than 60.

x <- my_gapminder$lifeExp # vector of values
alternative <- "less" # indicates we are testing 
mu <- 60 # our mean

my_t.test(x, alternative, mu)

From the returned list, we are able to observe that our p-value is much smaller compared to $\alpha\$, so we are able to reject the null hypothesis that the true mean is greater than 60.

Finally, we perform our t-test on the alternative hypothesis that the mean is greater than 60.

x <- my_gapminder$lifeExp # vector of values
alternative <- "greater" # indicates we are testing 
mu <- 60 # our mean

my_t.test(x, alternative, mu)

Since our p-value is far larger than an $\alpha$ value of 0.05, we fail to reject null hypothesis that the mean is less than 60.

my_lm tutorial

my_lm performs linear regression given a formula and a dataframe. A formula will follow this format: my_lm(dependent variable ~ independent variable + independent variable 2 + ..., data).

Here, we will apply regression to the lifeExp variable, from the my_gapminder data, using gdpPercap and continent as our predictors.

lm <- my_lm(lifeExp ~ gdpPercap + continent, my_gapminder) # call my_lm to get linear model
lm # print out model

From the linear model table, we can observe the many values associated with it. The gdpPercap coefficient represents the change in lifeExp per unit increase/decrease in gdpPercap. The p-value is also computed with the linear model, meaning that there is a hypothesis related to each variable. The null hypothesis for the gdpPercap variable is that there is no relationship between gdpPercap and lifeExp. However, we can see that our p-value for gdpPercap is smaller than an $\alpha$ value of 0.05, meaning we can reject the null hypothesis that there is no relationship. This means that adding gdpPercap to this model is meaningful when predicting lifeExp and that there is some kind of meaningful relationship between gdpPercap and lifeExp.

We are able to take this model and compute the fitted values. So, to better understand how well this model did, we can plot these points on a plot.

x <- model.matrix(lifeExp ~ gdpPercap + continent, my_gapminder) # model matrix
est <- lm[, 1] # coefficients from linear model 

fitted <- x %*% est # predicted values

# put the fitted and real values into one dataframe
fit_act <- data.frame(cbind(fitted, my_gapminder$lifeExp)) 

colnames(fit_act) <- c("fitted", "actual") # rename columns

# plot the fitted and actual values on scatterplot!
ggplot(fit_act) +
  geom_point(aes(fitted, actual)) +
  geom_abline(slope = 1, intercept = 0, col = "red", lty = 2)

Since the points do not follow the straight 45 degree line (red dashed line), it does not seem that our model predicts very well. In fact, the points are grouped up into continents as we can see gaps between the groups of points. Further, in these groups, we can see that the points run up and down, not on the dashed red line, meaning that we have pretty big positive and negative residuals. However, it does seem to predict the grouped up points in the 65-80 range pretty well as the points float around the red dashed line and trend up with it. In other words, the fitted values were very similar to the actual values in those ranges.

my_knn_cv tutorial

my_knn_cv performs k-nearest neighbors cross validation given an output class, covariate dataframe, number of folds, and the number of neighbors. It will follow this format: my_knn_cv(data used to predict a column, column to predict, number of neighbors, number of folds).

Here, we are using the my_penguins dataset to predict our output class, species, with a 5-fold cross validation, and recording the testing and training errors for k-nearest neighbors 1 to 10. To do this, we are using a for-loop to loop to call our my_knn_cv function 10 times.

But first, make sure to omit any NA observations!

my_penguins <- na.omit(mypackage::my_penguins) # load data and omit NAs

cv_error <- c() # initialize cv error vector

train_error <- c() # intialize training error vector

# loop through 10 times
for (k_nn in 1:10) {
  # compute cv error for k_nn neighbors
  cv_error[k_nn] <- my_knn_cv(my_penguins[, 3:6], 
                        my_penguins$species,
                        k_nn = k_nn, 
                        k_cv = 5)$cv_error

  # compute training error for k_nn neighbors
  train_error[k_nn] <- mean(knn(my_penguins[, 3:6], 
                          my_penguins[, 3:6], 
                          my_penguins$species, 
                          k_nn) != my_penguins$species)
}

# store into one dataframe
error_df_knn <- data.frame(
  k_nn = 1:10,
  cv_error = cv_error,
  train_error = train_error
)

error_df_knn # print out dataframe

We use cross validation to randomly split data to compare to each other, this way, we are able to gauge the performance of our models by analyzing the errors! In particular, we use cross validation to minimize overfitting as we are able to use multiple sets of training data within the dataset to create the best model. So, based on the cv error I would choose k_nn = 1 since it seems to yield the lowest error. However, based on testing error I would choose k_nn = 2 since choosing k_nn = 1 for training is meaningless. This is because training error will always be 0 when k_nn = 1 no matter what. Thus, picking the next k_nn with the lowest error would be the wisest choice.

Thus, we would choose k_nn = 2 since we are minimizing our risk to underfitting but also overfitting.

my_rf_cv tutorial

my_rf_cv performs random forest cross validation to the my_penguins dataset given k-folds to predict the species of the penguins. It returns the mean squared error of a given k-fold.

k_cv <- c() # initialize vector of k folds

cv_error <- c() # initialize cv error vector

# loop through 2, 5, and 10
for (k in c(2, 5, 10)) {
  # loop through each number 30 times
  for (i in 1:30) {
    cv_error <- append(cv_error, my_rf_cv(k)) # add recorded error to vector
    k_cv <- append(k_cv, k) # add k-fold to vector
  }
}

# put into one dataframe
error_df_rf <- data.frame(
  k = k_cv,
  cv_error = as.numeric(cv_error)
)

# show a boxplot of the errors
ggplot(error_df_rf) +
  geom_boxplot(aes(x = cv_error, y = as.factor(k))) +
  labs(x = "CV error", y = "k")

# a table of the mean and standard deviation of errors for each fold
error_df_rf %>% 
  group_by(k) %>% 
  summarize(mean = mean(cv_error), sd = sd(cv_error))

We can see that the variance decreases as we increase k value. This is seen in the decrease in standard deviation and the smaller boxplots. We see this difference because as we use more folds in our cross validation, we begin to overfit our predictions, thus leading to a lower error value.

kobesar/mypackage documentation built on Dec. 21, 2021, 7:40 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com