knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(STAT302Package) library(ggplot2) library(dplyr) library(kableExtra) set.seed(617)
STAT302Package
was developed as part of Project 3, the final project for STAT 302 (Statistical Computing) taught by Bryan Martin, PhD at the University of Washington.
STAT302Package
is installed from GitHub using:
devtools::install_github("smokingcrater/STAT302Package")
Once installed, STAT302Package
is loaded as follows:
library(STAT302Package)
STAT302Package
is comprised of four r functions: my_t.test()
, my_lm()
, my_knn_cv()
, and my_rf_cv()
. Each function will be discussed below.
NOTE: Copies of the gapminder
(from the gapminder
package) and penguins
(from the palmerpenguins
package) data sets have been incorporated into STAT302Package
and are named my_gapminder
and my_penguins
.
my_t.test()
my_t.test()
performs a one sample t-test in R and offers the same essential functionality as t.test()
(from the stats
package).
To demonstrate my_t.test()
, we'll be analyzing life expectancy (lifeExp
) from the my_gapminder
data set Prior to conducting hypothesis tests, let's review the life expectancy data.
summary(my_gapminder$lifeExp)
NOTE: In the following three hypothesis tests, the significance level ($\alpha$) for p-value cut-off is 0.05; in other words, $\alpha = 0.05$.
Hypothesis test 1 (two-sided/tailed t-test) \begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}
The following p-value is nominally greater than $\alpha$, thus indicating weak evidence against the null hypothesis. Therefore we fail to reject the null hypothesis.
my_t.test(my_gapminder$lifeExp, alternative = "two.sided", mu = 60)
Hypothesis test 2 (one-sided/tailed t-test) \begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}
The following p-value is slightly less than $\alpha$, thus indicating strong evidence against the null hypothesis. Therefore we reject the null hypothesis.
my_t.test(my_gapminder$lifeExp, alternative = "less", mu = 60)
Hypothesis test 3 (one-sided/tailed t-test) \begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}
The following p-value is significantly greater than $\alpha$, thus indicating weak evidence against the null hypothesis. Therefore we fail to reject the null hypothesis.
my_t.test(my_gapminder$lifeExp, alternative = "greater", mu = 60)
my_lm()
my_lm()
fits a linear model in R and offers the same essential functionality as lm()
(from the stats
package).
Linear regression is performed below on the my_gapminder
data set using life expectancy (lifeExp
{see my_t.test()
above for a review of lifeExp
}) as the response variable (y) and per capita GDP (gdpPercap
) and continent (continent
) as the exploratory variables (x).
my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder)
my_lm()
returned a summary table of the intercept and the exploratory variables. Looking at the gdpPercap
coefficient in this table, we see that the resulting two-sided test output is, not only less than one, it's virtually zero. This suggests that GDP per capita impacts life expectancy far less than does the continent where one resides.
The following scatterplot illustrates the relationship between the actual and fitted values and includes a best fit (regression) line. The plot suggests non-conclusive model fit since it's both a good fit in some cases (e.g.; Europe, where both the predicted and the actual life expectancies lie in the general range of low 70s to low 80s) while a suboptimal fit for others (e.g; Africa, where the predicted range is from the high 40s to high 50s though the actual range is from the mid 30s to the mid 70s).
# To obtain fitted values, multiply X %*% beta where: # X is the result of model.matrix() # beta represents the estimates returned from my_ln(). fitted_life_exp <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) %*% my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder)[, 1] # Create a data frame of the actual and fitted values along with their # associated continent. actual_vs_fit_df <- data.frame(actual = my_gapminder$lifeExp, fitted = fitted_life_exp, Continent = my_gapminder$continent) # Plot actual vs fitted values and use color to distinguish continents. avf <- ggplot(actual_vs_fit_df, aes(x = fitted, y = actual, col = Continent)) + geom_point(size = 0.75) + geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) + theme_bw(base_size = 10) + labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + theme(plot.title = element_text(hjust = 0.5, face = "bold")) avf
my_knn_cv()
my_knn_cv()
applies cross-validation to a nearest neighbor algorithm (knn()
from the class
package).
The following example predicts the penguin species (species
) (classification = y) using four characteristics (covariates = x): bill length (bill_length_mm
), bill depth (bill_depth_mm
), flipper length (flipper_length_mm
) and body mass (body_mass_g
). 5-fold cross-validation is applied to the nearest neighbor algorithm 10 times (starting with one neighbor and successively incrementing to ten neighbors).
# Subset columns (classification = y = species, covariates = x = # bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g) # and eliminate NAs. clean_penguins <- my_penguins %>% select(species, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) %>% na.omit() # Covariates (x data).. my_train_nn <- clean_penguins %>% select(-species) # Classification (y data). my_cl <- clean_penguins$species # Maximum of 10 nearest neighbors. k <- 10 # Data frame to gather results for table. synopsis <- data.frame( "nn" = rep(NA, times =10), "cv_error" = rep(NA, times = 10), "training_error" = rep(NA, times = 10) ) # Successively apply nearest neighbor algorithm using cross-validation for # 1 to k (in this case, 10) neighbors. for (i in 1:k) { result <- my_knn_cv(train = my_train_nn, cl = my_cl, k_nn = i, k_cv = 5) synopsis[i, 1] <- i synopsis[i, 2] <- result$cv_err synopsis[i, 3] <- mean(my_cl != result$class) }
Both the cross-validation misclassification rates and the training misclassification rates (no cross-validation) for the successive calls (1 to 10 neighbors) are documented in the following table:
colnames(synopsis) <- c("# of Nearest Neighbors", "Cross-Validation Misclassification Error", "Training Set Misclassification Error") mes <- kable_styling(kable(synopsis)) mes
In theory, based solely on the above results, I'd choose cross-validation model 1 (1 neighbor) based on the cross-validation misclassification rates since it has the lowest cross-validation misclassification error. Likewise, I'd choose training model 1 (1 neighbor) based on the training misclassification rates since its misclassification error is 0.
As discussed in the STAT302 Statistical Prediction lectures, it's fairly straightforward to fit a model with low variance / high bias or with high variance / low bias but such models run the risk of being overfitted to the training data.
Cross-validation is a simple concept in which the training data set is split into multiple (k) parts (folds). All but one of the folds is used to train the model/method while the remaining fold is used to make predictions. This process is repeated until all of the folds have been utilized to both train and predict. By employing cross-validation, we're able to build a model utilizing the full data set while evaluating the model's performance on out-of-sample data. It's important to remember that cross-validation is used to compare and choose between multiple predictive models by analyzing their cross-validation miscalculation errors.
So, in practice, I'd likely use cross-validation with a k value of 5-10 to analyze models and choose the appropriate predictive model based on the lowest cross-validation miscalculation errors.
my_rf_cv()
my_rf_cv()
applies cross-validation to a random forest algorithm (randomForest()
from the randomForest
package). This function is hard-wired to work solely on the my_penguins
data set; the only meaningful parameter (an additional feedback
parameter exists solely for debugging) is k
which represents the desired number of cross-validation folds.
Below we predict body mass (body_mass_g
) (classification = y) using three characteristics (covariates = x): bill length (bill_length_mm
), bill depth (bill_depth_mm
), and flipper length (flipper_length_mm
). Cross-validation is applied using fold counts of 2, 5, and finally 10 to the random forest algorithm, 30 times per each of the three specified fold counts.
# Cross-validation fold counts. folds <- c(2, 5, 10) # Number of simulations to run for each fold count. iterations <- 30 # Matrix to gather results for table. results <- matrix(nrow = iterations, ncol = length(folds)) # Successively apply random forest algorithm using cross-validation for # each fold count specified.. for (k in 1:length(folds)) { for (iteration in 1:iterations) { # tmp <- my_rf_cv(folds[k]) # cat("results[", iteration, ", ", k, "]: ", tmp, "\n") results[iteration, k] <- my_rf_cv(folds[k]) } }
The resulting cross-validation estimated mean squared errors (MSEs) for each fold count (k
) are depicted in the following boxplots (outliers in solid grey circles):
# Convert from matrix to data.frame for ggplot2. mse_df <- data.frame(k = as.factor(rep(folds, each = iterations)), mse = c(results[, 1], results[, 2], results[, 3])) # Generate a boxplot for each fold count. cvmse <- ggplot(data = mse_df, aes(x = k, y = mse, color = k)) + geom_boxplot(outlier.colour = "grey68", outlier.shape = 16, outlier.size = 2, notch = TRUE) + geom_jitter(shape = 2, position = position_jitter(0.2)) + scale_color_manual(values =c ("darkorange1", "seagreen1", "royalblue1")) + theme_bw(base_size = 10) + labs(x = "k (number of folds)", y = "CV estimated MSE", title = "Cross Validation MSEs") + theme(plot.title = element_text(hjust = 0.5, face = "bold")) cvmse
The resulting mean and standard deviation of the cross-validation estimated MSEs for each fold count (k
) are presented below:
# Convert from matrix to data.frame for kable table. synopsis <- data.frame( "average" = rep(NA, times = ncol(results)), "stdev" = rep(NA, times = ncol(results)) ) rn <- rep(NA, times = ncol(results)) # Display mean and standard deviation for each fold. for (i in 1:length(folds)) { synopsis[i, 1] <- mean(results[, i]) synopsis[i, 2] <- sd(results[, i]) rn[i] <- paste("k (number of folds) = ", folds[i]) } # Label table. rownames(synopsis) <- rn colnames(synopsis) <- c("Mean CV Estimate", "Standard Deviation CV Estimate") cves <- kable_styling(kable(synopsis)) cves
The above boxplots depict the cross-validation estimated MSEs visually while the above table depicts said data textually. They both suggest that as the number of cross-validation folds increases, both the mean and standard deviations of the cross-validation MSEs, which is what I'd expect.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.