```{=html}

<!--- End styling code. --->

```r
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(StatsTools)

Introduction

StatsTools is a package of statistics functions written as part of the STAT302 class at the University of Washington. It includes a t-test, a linear model, an implementation and cross-validation of the k-nearest neighbor algorithm, and cross-validation of a random forest algorithm applied to the my_penguins data.

You can install from Github and load StatsTools with:

# install.packages("devtools")
devtools::install_github("marcradke/StatsTools")
library(StatsTools)

my_t.test

To demonstrate the t-test function, we will use the lifeExp data from gapminder.

First, we will test the hypothesis \begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}

# Load my_gapminder data
data(my_gapminder)
# Call function with default alternative setting
my_t.test(my_gapminder$lifeExp, mu = 60)

Using a p-value cut-off of $\alpha = 0.05$, we fail to reject the null hypothesis that the true mean life expectancy is 60 and find no evidence to support the alternative hypothesis that the true mean life expectancy is not equal to 60.

Next, we will test the hypothesis \begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}

my_t.test(my_gapminder$lifeExp, alternative = "less", mu = 60)

Again using $\alpha = 0.05$, here we are able to reject the null hypothesis and find that the data support our alternative hypothesis: that the true life expectancy is less than 60.

Finally, we will test the hypothesis \begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}

my_t.test(my_gapminder$lifeExp, alternative = "greater", mu = 60)

With $\alpha = 0.05$, we fail to reject our null hypothesis and find no evidence to support the alternative hypothesis that the true mean life expectancy is greater than 60.

my_lm

To demonstrate the linear model function, we will again use data from my_gapminder and fit a linear model of the form lifeExp ~ gdpPercap + continent.

# Fit linear model and print output
gap_lm <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)
kableExtra::kable_styling(kableExtra::kable(gap_lm, digits = 150))

An estimate of 0.00045 for the coefficient of gdpPercap means that, all other covariate being equal, a change of one unit in gdpPercap would change lifeExp by 0.00045. The table also gives us output of a hypothesis test with null hypothesis that the coefficient of gdpPercap is equal to zero and the alternative hypothesis that it is not equal to zero. Using a p-value cutoff of $\alpha = 0.05$, we reject the null hypothesis and find evidence that the coefficient is not zero.

We can plot the actual and fitted values of gdpPercap to get a sense of our model fit:

library(ggplot2)
# Get actual values
actual <- my_gapminder$lifeExp
# Extract model matrix X from linear model
X <- stats::model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
# Create vector of fitted values (Y-hat) by matrix-multiplying X by estimates
# of the coefficients (beta)
fitted <- X %*% as.matrix(gap_lm$Estimate)
# Combine into dataframe
avsf <- data.frame("fitted" = fitted,
                   "actual" = actual)
# Plot Actual vs. Fitted
avsfPlot <- ggplot(data = avsf, aes(y = actual, x = fitted)) +
  geom_point(size = .9, color = "#ff6961") +
  geom_abline(slope = 1, intercept = 0, lty = 2, color = "white") + 
  labs(title = "Actual vs. Fitted",
       y = "Actual",
       x = "Fitted") +
  ylim(20, 85) +
  xlim(20,110) +
  theme_dark() +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.margin = margin(0.5, 1, 0.5, 0.5, "cm"))
avsfPlot

For a well-fit model, we would expect points around the line y = x. Our model appears to be a decent fit for values above ~ 70, but very poor otherwise.

my_knn_cv

To demonstrate the k-nearest neighbors cross-validation function, we will use data from my_penguins and predict species using bill_length_mm, bill_depth_mm, flipper_length_mm, and body_mass_g. We will use 5-fold cross-validation and iterate from 1 to 10 k-nearest neighbors.

# Load the my_penguins data
data(my_penguins)
# Remove entries missing data
penguins <- my_penguins %>% tidyr::drop_na()
# Create training data
train <- penguins[,3:6]
# Create true class vector
cl <- penguins[,1]
# Create empty results dataframe
outpt <- data.frame("k_nn" = 1:10,
                    "err_train" = rep(NA, 10),
                    "err_cv" = rep(NA, 10))
# Clean column display names
colnames(outpt) <- c("k_nn", "Training Error", "CV Error")
# Iterate through k_nn = 1:10
for (i in 1:10) {
  pred <- my_knn_cv(train, cl, k_nn = i, k_cv = 5)
  # Record the training error
  outpt[i,2] <- mean((1-(penguins$species == pred[[1]]))^2)
  # Record the cross-validation error
  outpt[i,3] <- pred[[2]]
}
# Output table
kableExtra::kable_styling(kableExtra::kable(outpt))

Both the training and CV mis-classification error are minimized where k_nn = 1, and based solely on these metrics, a model where k_nn = 1 would be chosen. However, such a model would predict the class of test data based solely on their proximity to single datapoints in the training data, rather than trying to detect some trend in the sample space. In practice, a model where k_nn = 9 would be the best of these 10 because it has the next-lowest CV mis-classification error.

my_rf_cv

To demonstrate the random forest cross-validation function, we will again use the my_penguins data. We will predict body_mass_g using bill_length_mm, bill_depth_mm, and flipper_length_mm and iterate through k = 2, k = 5, and k = 10 (where k is the number of folds in our cross-validation) 30 times each to analyze the spread of the average MSE (mean squared error) .

reps <- 30
k_vals <- c(2, 5, 10)
# Create empty output dataframe
out <- data.frame("k" = rep(NA, reps * length(k_vals)),
                  "mse" = rep(NA, reps * length(k_vals)))

# Iterate through each k and each repetition and record the MSE
start <- 0
for (k in k_vals) {
  for (i in 1:reps) {
    out[start + i, 1] <- paste0("k_",k)
    out[start + i, 2] <- my_rf_cv(k)
  }
start <- start + reps
}

Next, we'll plot the MSE for each k and summarize the data:

# Make k column a factor and relevel
out$k <- factor(out$k, levels = c("k_2", "k_5", "k_10"))

# Create boxplots for each value of k
rf_plot <- ggplot(data = out, aes(y = mse, x = k)) +
  geom_boxplot(fill = "#ff6961") +
  scale_x_discrete(labels = c(2, 5, 10)) +
  labs(title = "Mean Squared Error For Each k",
       y = "MSE") +
  theme_dark() +
  theme(plot.title = element_text(hjust = 0.5), 
        plot.margin = margin(0.5, 1, 0.5, 0.5, "cm"))
rf_plot
# Group by k and summarize: mean and sd
summary <- out %>%
  # Change k value strings to numbers so table looks nicer
  dplyr::mutate(k = rep(c(2,5,10), each = 30)) %>%
  dplyr::group_by(k) %>%
  dplyr::summarize(mse_mean = mean(mse),
            mse_sd = sd(mse))
# assign colnames for printing
colnames(summary) <- c("k", "Mean MSE", "SD MSE")
# Format and print table
kableExtra::kable_styling(kableExtra::kable(summary))

As k increases, the mean and standard deviation of the MSE tends to decrease, possibly because more folds means that more training data is used to fit the model each time.



marcradke/StatsTools documentation built on Dec. 21, 2021, 1:51 p.m.