knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(mypackage) library(ggplot2) library(dplyr) library(class) data("my_penguins") data("my_gapminder")
mypackage
is a useful statistical package that includes many useful functions for statistical computing! It includes functions: my_t.test
, my_lm
, my_knn_cv
, and my_rf_cv
. It also includes two datasets: my_penguins
and my_gapminder
.
To install mypackage
onto R, run the following line:
# install.packages("devtools") devtools::install_github("kobesar/mypackage")
To load the mypackage
run the following line:
library(mypackage)
We can load my_penguins
and my_gapminder
with the following lines:
data("my_penguins") data("my_gapminder")
Also, in this tutorial, we will be using ggplot2
, dplyr
, and class
to help demonstrate how to use the functions included in mypackage
.
To run the following packages run the following lines:
# install.packages(c("ggplot", "dplyr", "class")) library(ggplot2) library(dplyr) library(class)
my_t.test
performs a t-test, given a vector of values, a character indicating the type of test, and a numeric indicating the testing mean. The alternative can be either "two.sided", "less", or "greater", otherwise, it will throw an error.
We will use the lifeExp
column from the my_gapminder
data to apply a t-test.
First we will perform a two sided t-test where our alternative hypothesis states that the true mean is not equal to 60.
x <- my_gapminder$lifeExp # vector of values alternative <- "two.sided" # indicates we are testing mu <- 60 # our mean my_t.test(x, alternative, mu)
We can see from the printed list that the p-value is larger than our $\alpha$ value of 0.05 so we fail to reject the null hypothesis that the true mean is equal to 60.
Then, we perform our t-test on the alternative hypothesis that the mean is less than 60.
x <- my_gapminder$lifeExp # vector of values alternative <- "less" # indicates we are testing mu <- 60 # our mean my_t.test(x, alternative, mu)
From the returned list, we are able to observe that our p-value is much smaller compared to $\alpha\$, so we are able to reject the null hypothesis that the true mean is greater than 60.
Finally, we perform our t-test on the alternative hypothesis that the mean is greater than 60.
x <- my_gapminder$lifeExp # vector of values alternative <- "greater" # indicates we are testing mu <- 60 # our mean my_t.test(x, alternative, mu)
Since our p-value is far larger than an $\alpha$ value of 0.05, we fail to reject null hypothesis that the mean is less than 60.
my_lm
performs linear regression given a formula and a dataframe. A formula will follow this format: my_lm(dependent variable ~ independent variable + independent variable 2 + ..., data)
.
Here, we will apply regression to the lifeExp
variable, from the my_gapminder data, using gdpPercap
and continent
as our predictors.
lm <- my_lm(lifeExp ~ gdpPercap + continent, my_gapminder) # call my_lm to get linear model lm # print out model
From the linear model table, we can observe the many values associated with it. The gdpPercap coefficient represents the change in lifeExp per unit increase/decrease in gdpPercap. The p-value is also computed with the linear model, meaning that there is a hypothesis related to each variable. The null hypothesis for the gdpPercap variable is that there is no relationship between gdpPercap and lifeExp. However, we can see that our p-value for gdpPercap is smaller than an $\alpha$ value of 0.05, meaning we can reject the null hypothesis that there is no relationship. This means that adding gdpPercap to this model is meaningful when predicting lifeExp and that there is some kind of meaningful relationship between gdpPercap and lifeExp.
We are able to take this model and compute the fitted values. So, to better understand how well this model did, we can plot these points on a plot.
x <- model.matrix(lifeExp ~ gdpPercap + continent, my_gapminder) # model matrix est <- lm[, 1] # coefficients from linear model fitted <- x %*% est # predicted values # put the fitted and real values into one dataframe fit_act <- data.frame(cbind(fitted, my_gapminder$lifeExp)) colnames(fit_act) <- c("fitted", "actual") # rename columns # plot the fitted and actual values on scatterplot! ggplot(fit_act) + geom_point(aes(fitted, actual)) + geom_abline(slope = 1, intercept = 0, col = "red", lty = 2)
Since the points do not follow the straight 45 degree line (red dashed line), it does not seem that our model predicts very well. In fact, the points are grouped up into continents as we can see gaps between the groups of points. Further, in these groups, we can see that the points run up and down, not on the dashed red line, meaning that we have pretty big positive and negative residuals. However, it does seem to predict the grouped up points in the 65-80 range pretty well as the points float around the red dashed line and trend up with it. In other words, the fitted values were very similar to the actual values in those ranges.
my_knn_cv
performs k-nearest neighbors cross validation given an output class, covariate dataframe, number of folds, and the number of neighbors. It will follow this format: my_knn_cv(data used to predict a column, column to predict, number of neighbors, number of folds)
.
Here, we are using the my_penguins
dataset to predict our output class, species
, with a 5-fold cross validation, and recording the testing and training errors for k-nearest neighbors 1 to 10. To do this, we are using a for-loop to loop to call our my_knn_cv
function 10 times.
But first, make sure to omit any NA
observations!
my_penguins <- na.omit(mypackage::my_penguins) # load data and omit NAs cv_error <- c() # initialize cv error vector train_error <- c() # intialize training error vector # loop through 10 times for (k_nn in 1:10) { # compute cv error for k_nn neighbors cv_error[k_nn] <- my_knn_cv(my_penguins[, 3:6], my_penguins$species, k_nn = k_nn, k_cv = 5)$cv_error # compute training error for k_nn neighbors train_error[k_nn] <- mean(knn(my_penguins[, 3:6], my_penguins[, 3:6], my_penguins$species, k_nn) != my_penguins$species) } # store into one dataframe error_df_knn <- data.frame( k_nn = 1:10, cv_error = cv_error, train_error = train_error ) error_df_knn # print out dataframe
We use cross validation to randomly split data to compare to each other, this way, we are able to gauge the performance of our models by analyzing the errors! In particular, we use cross validation to minimize overfitting as we are able to use multiple sets of training data within the dataset to create the best model. So, based on the cv error I would choose k_nn = 1 since it seems to yield the lowest error. However, based on testing error I would choose k_nn = 2 since choosing k_nn = 1 for training is meaningless. This is because training error will always be 0 when k_nn = 1 no matter what. Thus, picking the next k_nn with the lowest error would be the wisest choice.
Thus, we would choose k_nn = 2 since we are minimizing our risk to underfitting but also overfitting.
my_rf_cv
performs random forest cross validation to the my_penguins
dataset given k-folds to predict the species of the penguins. It returns the mean squared error of a given k-fold.
k_cv <- c() # initialize vector of k folds cv_error <- c() # initialize cv error vector # loop through 2, 5, and 10 for (k in c(2, 5, 10)) { # loop through each number 30 times for (i in 1:30) { cv_error <- append(cv_error, my_rf_cv(k)) # add recorded error to vector k_cv <- append(k_cv, k) # add k-fold to vector } } # put into one dataframe error_df_rf <- data.frame( k = k_cv, cv_error = as.numeric(cv_error) ) # show a boxplot of the errors ggplot(error_df_rf) + geom_boxplot(aes(x = cv_error, y = as.factor(k))) + labs(x = "CV error", y = "k") # a table of the mean and standard deviation of errors for each fold error_df_rf %>% group_by(k) %>% summarize(mean = mean(cv_error), sd = sd(cv_error))
We can see that the variance decreases as we increase k value. This is seen in the decrease in standard deviation and the smaller boxplots. We see this difference because as we use more folds in our cross validation, we begin to overfit our predictions, thus leading to a lower error value.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.