Introduction

This package was designed for my UW STAT 302 course's Project 3. It contains four methods for performing statistical inference/prediction: \begin{itemize} \item Linear Regression \item T-Test \item K-Nearest Neighbors with Cross-Validation \item Random Forests with Cross Validation \end{itemize}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

To install this package through Github, use

devtools::install_github("CamSims/myfirstpackage")

To begin, we load the package.

library(myfirstpackage)

my_t.test Tutorial

This section will demonstrate how to use the my_t.test() function.

# Use the lifeExp data from my_gapminder
dat <- my_gapminder$lifeExp

# Demonstrate a test of the hypothesis H0: mu = 60, Ha: mu != 60
my_t.test(x = dat, alternative = "two.sided", mu = 60)

At a cutoff of $\alpha = 0.05$, the p-value of $0.0932$ means the null hypothesis can not be rejected. In other words, $60$ is a valid value for $\mu$.

# Demonstrate a test of the hypothesis H0: mu = 60, Ha: mu < 60
my_t.test(x = dat, alternative = "less", mu = 60)

Using the $\alpha = 0.05$ cutoff, the p-value of $0.0466$ means the null hypothesis can be rejected. In other words, the data suggests $\mu$ is less than $60$.

# Demonstrate a test of the hypothesis H0: mu = 60, Ha: mu > 60
my_t.test(x = dat, alternative = "greater", mu = 60)

Using the $\alpha = 0.05$ cutoff, the p-value of $.9533$ means the null hypothesis can not be rejected. The data suggests $mu = 60$ is a valid value.

my_lm Tutorial

This section will demonstrate how to use the my_lm() function.

# Performing linear regression on lifeExp using gdpPercap and continent
model <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)

# Show the model
model

The gdpPercap coefficient indicates that a unit increase results in an increase of roughly .0004 in life expectancy. Let $\beta_{1}$ be the coefficient. The hypothesis test associated with this is then $H_{0}: \beta_{1} = 0$, $H_{a}:\beta_{1} \neq 0$. The associated p-value is very close to 0, so a cut-off of $\alpha = 0.05$ results in rejection of the null hypothesis. In other words, the data suggests $\beta_{1}$ is not $0$.

# Calculate beta for fitted values
beta <- model$Estimate

# Calculate x for fitted values
x <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)

# Calculate fitted values
fitted <- x %*% beta

# Plot Actual vs Fitted Values
ggplot2::ggplot(mapping = ggplot2::aes(x = fitted, y = my_gapminder$lifeExp)) + 
  ggplot2::geom_line() +
  ggplot2::labs(title = "Actual Vs Fitted Values") +
  ggplot2::xlab("Fitted Value") +
  ggplot2::ylab("Actual Value")

From the plot, it appears the fitted values don't match the actual values very well. This means the model fit isn't very good.

my_knn_cv Tutorial

# Remove NA values and select training columns
train_dat <- na.omit(my_penguins)[c("bill_length_mm", "bill_depth_mm",
                  "flipper_length_mm", "body_mass_g")]

# Omit NA values and select classifying column
correct <- data.frame("species" = na.omit(my_penguins)$species)

# Arrays to hold results
training_errors <- c()
cv_misclass <- c()

# Perform k nearest_neighbors
for (k in 1:10) {
  results <- my_knn_cv(train_dat, correct, k_nn = k, k_cv = 5)
  cv_misclass[k] <- results$cv_err
  training_errors[k] <- mean(results$class != correct$species)
}

# Find model with least training error
least_train <- which.min(training_errors)

# Find model with least cv misclassification rate
least_cv_misclass <- which.min(cv_misclass)

The model to choose based on the training misclassification rates is the one using r least_train neighbors. The model to choose based on CV misclassification rates is the one that uses r least_cv_misclass neighbors. In practice, you would choose the model that uses r least_cv_misclass neighbors. CV (cross-validation) means the method is splitting the data into a given number of folds, in this case 5, then going one fold at a time. It uses the current fold as its testing data and all other folds training data. This allows the method to evaluate the performance of the covariates on data that was not used to train it. The model chosen to use in practice has the lowest CV misclassification rate, which means it can better predict new data that is added to it.

my_rf_cv Tutorial

# Vector to hold CV estimates MSEs
mses <- c()

# Using k = c(2, 5, 10), perform random forest 30 times each
for (k in c(2, 5, 10)) {
  for (i in 1:30) {
    mses <- append(mses, my_rf_cv(k))
  }
}

# Separate MSEs by k
dat <- data.frame(mse = mses, k = rep(c(2, 5, 10), each = 30))
# Create a boxplot for each value of k
ggplot2::ggplot(data = dat, mapping = ggplot2::aes(x = as.factor(k), y = mse)) + 
  ggplot2::geom_boxplot() +
  ggplot2::labs(title = "Distribution Of MSE For Different Numbers Of Folds") +
  ggplot2::xlab("Number of Folds") +
  ggplot2::ylab("CV-Estimated MSE")

# Calculate average CV estimate and standard deviation for k = 2
two_avg <- mean(subset(dat, k == 2)$mse)
two_sd <- sd(subset(dat, k == 2)$mse)

# Calculate average CV estimate and standard deviation for k == 5
five_avg <- mean(subset(dat, k == 5)$mse)
five_sd <- sd(subset(dat, k == 5)$mse)

# Calculate average CV estimate and standard deviation for k == 10
ten_avg <- mean(subset(dat, k == 10)$mse)
ten_sd <- sd(subset(dat, k == 10)$mse)

# Display the average CV estimates and standard deviations in a table
data.frame(Mean_CV = c(two_avg, five_avg, ten_avg),
           SD_CV = c(two_sd, five_sd, ten_sd))

From the boxplots, it appears the median CV-Estimated MSE decreases as the number of folds increases. 5 folds has the smallest interquartile range of the three fold amounts and exists almost entirely within the range of the 10-fold boxplot. The 2-fold boxplot has the largest range and is centered higher than the other boxplots. Numerically, the table shows 10 folds has the lowest average CV estimate whereas five folds has the lowest CV standard deviation. Two folds has both the highest mean CV estimate and the highest CV standard deviation.

Generally, test error (CV MSE) will decrease as the number of folds increases. This is because more data is being used to calculate the predicted value(s). However, after a certain point, this value can decrease as the model becomes less reliable (training error increases). As such, it is possible the best number of folds is close to five, or between five and ten. While ten folds might have the lowest mean CV-Estimated MSE, it has higher standard deviation than five-folds and thus is not conclusively better.



CamSims/myfirstpackage documentation built on Dec. 17, 2021, 1:56 p.m.