This package was designed for my UW STAT 302 course's Project 3. It contains four methods for performing statistical inference/prediction: \begin{itemize} \item Linear Regression \item T-Test \item K-Nearest Neighbors with Cross-Validation \item Random Forests with Cross Validation \end{itemize}
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
To install this package through Github, use
devtools::install_github("CamSims/myfirstpackage")
To begin, we load the package.
library(myfirstpackage)
This section will demonstrate how to use the my_t.test() function.
# Use the lifeExp data from my_gapminder dat <- my_gapminder$lifeExp # Demonstrate a test of the hypothesis H0: mu = 60, Ha: mu != 60 my_t.test(x = dat, alternative = "two.sided", mu = 60)
At a cutoff of $\alpha = 0.05$, the p-value of $0.0932$ means the null hypothesis can not be rejected. In other words, $60$ is a valid value for $\mu$.
# Demonstrate a test of the hypothesis H0: mu = 60, Ha: mu < 60 my_t.test(x = dat, alternative = "less", mu = 60)
Using the $\alpha = 0.05$ cutoff, the p-value of $0.0466$ means the null hypothesis can be rejected. In other words, the data suggests $\mu$ is less than $60$.
# Demonstrate a test of the hypothesis H0: mu = 60, Ha: mu > 60 my_t.test(x = dat, alternative = "greater", mu = 60)
Using the $\alpha = 0.05$ cutoff, the p-value of $.9533$ means the null hypothesis can not be rejected. The data suggests $mu = 60$ is a valid value.
This section will demonstrate how to use the my_lm() function.
# Performing linear regression on lifeExp using gdpPercap and continent model <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) # Show the model model
The gdpPercap coefficient indicates that a unit increase results in an increase of roughly .0004 in life expectancy. Let $\beta_{1}$ be the coefficient. The hypothesis test associated with this is then $H_{0}: \beta_{1} = 0$, $H_{a}:\beta_{1} \neq 0$. The associated p-value is very close to 0, so a cut-off of $\alpha = 0.05$ results in rejection of the null hypothesis. In other words, the data suggests $\beta_{1}$ is not $0$.
# Calculate beta for fitted values beta <- model$Estimate # Calculate x for fitted values x <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) # Calculate fitted values fitted <- x %*% beta # Plot Actual vs Fitted Values ggplot2::ggplot(mapping = ggplot2::aes(x = fitted, y = my_gapminder$lifeExp)) + ggplot2::geom_line() + ggplot2::labs(title = "Actual Vs Fitted Values") + ggplot2::xlab("Fitted Value") + ggplot2::ylab("Actual Value")
From the plot, it appears the fitted values don't match the actual values very well. This means the model fit isn't very good.
# Remove NA values and select training columns train_dat <- na.omit(my_penguins)[c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")] # Omit NA values and select classifying column correct <- data.frame("species" = na.omit(my_penguins)$species) # Arrays to hold results training_errors <- c() cv_misclass <- c() # Perform k nearest_neighbors for (k in 1:10) { results <- my_knn_cv(train_dat, correct, k_nn = k, k_cv = 5) cv_misclass[k] <- results$cv_err training_errors[k] <- mean(results$class != correct$species) } # Find model with least training error least_train <- which.min(training_errors) # Find model with least cv misclassification rate least_cv_misclass <- which.min(cv_misclass)
The model to choose based on the training misclassification rates is the one
using r least_train
neighbors. The model to choose based on CV
misclassification rates is the one that uses r least_cv_misclass
neighbors. In
practice, you would choose the model that uses r least_cv_misclass
neighbors.
CV (cross-validation) means the method is splitting the data into a given number
of folds, in this case 5, then going one fold at a time. It uses the current
fold as its testing data and all other folds training data. This allows the
method to evaluate the performance of the covariates on data that was not used
to train it. The model chosen to use in practice has the lowest CV
misclassification rate, which means it can better predict new data that is added
to it.
# Vector to hold CV estimates MSEs mses <- c() # Using k = c(2, 5, 10), perform random forest 30 times each for (k in c(2, 5, 10)) { for (i in 1:30) { mses <- append(mses, my_rf_cv(k)) } } # Separate MSEs by k dat <- data.frame(mse = mses, k = rep(c(2, 5, 10), each = 30))
# Create a boxplot for each value of k ggplot2::ggplot(data = dat, mapping = ggplot2::aes(x = as.factor(k), y = mse)) + ggplot2::geom_boxplot() + ggplot2::labs(title = "Distribution Of MSE For Different Numbers Of Folds") + ggplot2::xlab("Number of Folds") + ggplot2::ylab("CV-Estimated MSE") # Calculate average CV estimate and standard deviation for k = 2 two_avg <- mean(subset(dat, k == 2)$mse) two_sd <- sd(subset(dat, k == 2)$mse) # Calculate average CV estimate and standard deviation for k == 5 five_avg <- mean(subset(dat, k == 5)$mse) five_sd <- sd(subset(dat, k == 5)$mse) # Calculate average CV estimate and standard deviation for k == 10 ten_avg <- mean(subset(dat, k == 10)$mse) ten_sd <- sd(subset(dat, k == 10)$mse) # Display the average CV estimates and standard deviations in a table data.frame(Mean_CV = c(two_avg, five_avg, ten_avg), SD_CV = c(two_sd, five_sd, ten_sd))
From the boxplots, it appears the median CV-Estimated MSE decreases as the number of folds increases. 5 folds has the smallest interquartile range of the three fold amounts and exists almost entirely within the range of the 10-fold boxplot. The 2-fold boxplot has the largest range and is centered higher than the other boxplots. Numerically, the table shows 10 folds has the lowest average CV estimate whereas five folds has the lowest CV standard deviation. Two folds has both the highest mean CV estimate and the highest CV standard deviation.
Generally, test error (CV MSE) will decrease as the number of folds increases. This is because more data is being used to calculate the predicted value(s). However, after a certain point, this value can decrease as the model becomes less reliable (training error increases). As such, it is possible the best number of folds is close to five, or between five and ten. While ten folds might have the lowest mean CV-Estimated MSE, it has higher standard deviation than five-folds and thus is not conclusively better.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.