knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
# Loads packages and data frames from NewPackageLecture9
library(NewPackageLecture9)
library(dplyr)
library(ggplot2)
data("my_gapminder")
data("my_penguins")

To view the vignette, run lines 26 and 27 in console, then follow the following comments from lines 34 and 36 in the Vignette Install code chunk. Package "NewPackageLecture9" imports six mathematical functions and two supplementary data sets to use in those functions.

# Package instal instructions
devtools::install_github("danter2000/NewPackageLecture9", 
                         build_vignette = TRUE, 
                         build_opts = c())
# Calls the package
library(NewPackageLecture9)

# Use this to view the vignette in the Demo HTML help
help(package = "NewPackageLecture9", help_type = "html")
# Use this to view the vignette as an isolated HTML file
utils::browseVignettes(package = "NewPackageLecture9")

NewPackageLecture9 includes a t-test function named my_t.test. It can preform left, right, and two sided hypothesis tests. Here we will preform all three tests on a single column of data from the data frame my_gapminder with a null hypothesis equal to 60.

We will be creating a vector of the lifeExp observations called life_expectancy and using this vector in all three tests.

# From my_gapminder, pull and store the lifeExp column as a vector of numerics
life_expectancy <- my_gapminder %>% pull(lifeExp)

Two sided test using NewPackageLecture9's my_t.test.

# Two sided t.test with input vector of life expentancy and mu of 60
my_t.test(life_expectancy, alternative = "two.sided", mu = 60)

Since the p-value of .0932 for this two sided test is above our predetermined alpha of .05, we fail to reject the null hypothesis of Ho = 60, and say there is not enough evidence to conclude that the mean life expectancy from our data is statistically different from 60 years old.

Left sided test using NewPackageLecture9's my_t.test.

# Left sided t.test with input vector of life expectancy and mu of 60
my_t.test(life_expectancy, "less", 60)

Since the p-value of .0466 for this left sided test is slightly below our predetermined alpha of .05, we reject the null hypothesis of Ho = 60, and say there is evidence to conclude that the mean life expectancy from our data is statistically less than 60 years old.

Right sided test using NewPackageLecture9's my_t.test.

# Right sided t.test with input vector of life expectancy and mu of 60
my_t.test(life_expectancy, "greater", 60)

Since the p-value of .9533 for this right sided test is vastly above our predetermined alpha of .05, we fail to reject the null hypothesis of Ho = 60, and say there is no evidence to conclude that the mean life expectancy from our data is statistically greater than 60 years old.

This code chunk demonstrates the linear model function my_lm on data from my_gapminder. We will be creating a linear model using lifeExp as our response variable, and gdpPercap and continent as our explanatory variables.

# Output predictive measures from my_lm of life expectancy from gdpPercap and continent
output <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)

# Matrix of estimates from my_lm
beta <- output[,"Estimates"]

# Matrix that configures life expectancy from gdpPercap and continent
X <- model.matrix(lifeExp ~ gdpPercap + continent, my_gapminder)

# Fitted life expectancy values based on my_lm predictive model
fitted <- X %*% beta

# Data frame of fitted life expectancy, predicted life expectancy and corresponding continents.
tmp <- data.frame("Fitted" = fitted,  
                  "Actual" = my_gapminder$lifeExp, 
                  "Continent" = my_gapminder$continent)

# Calls the linear model
output


# Plot from tmp, x = actual life expectancy, y = predicted life expectancy based on my_lm 
ggplot(data = tmp, mapping = aes(x = Actual, y = Fitted, color = Continent)) + 

  # Scatter plot
  geom_point() +

  # add labels
  labs(title = "Life expectancy for different countries",
       x = "Actual life expectancy (years)",
       y = "Predicted life expectancy (years)") +

  # Centers and bolds title
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The p value for gdpPercap is to the negaitve 73rd power, meaning it is very small compared to any reasonable alpha level. This means that we would reject a null hypothesis and say gdpPercap is very associated with life expectancy.

Our gdpPercap coefficient is a numeric variable coefficient which represents the impact gpd per capita has on predicted life expectancy. From this model if we hold the impact of country constant, we see for every 1 currency unit increase of gdpPercap, there is a corresponding increase in life expectancy of approximately .000445 years. This would indicate it takes a high gdpPercap value to have a noticable difference on life expectancy.

This plot shows that Africa has the least variation in predicted life expectancy among the different countries, while every other continent has increasing variation in predicted life expectancy, especially as actual life expectancy increases. The predicted life expectancy years for Europe and Oceania seem to be very similar to the actual life expectancy years as measured in my_gapminder. I would say the model fit is very accurate for Europe and Oceania, and least accurate for Africa and the lower actual life expectancy countries in Asia and the Americas.

This code chunk uses k nearest neighbors prediction on data from my_penguins.

# Gets rid of NA values from penguins data
my_penguins <- na.omit(my_penguins)

# Creates data frame of just numeric values
pen_num <- my_penguins %>% select(bill_length_mm, 
                                  bill_depth_mm, 
                                  flipper_length_mm,
                                  body_mass_g)

# Creates vector of the species classifications
pen_cl <- my_penguins %>% pull(species)

# Makes vector of strings of those classes
pen_cl <- as.character(pen_cl)

# Initiallizes empty vector for cv error
cv_err <- rep(NA, 10)

# Initializes empty vector for training error
train_err <- rep(NA, 10)

# For the ith element in the vector from 1 to 10
for (i in 1:10) {

  # The output of my_knn_cv is repeated 10 times through the ith element
  output <- my_knn_cv(pen_num, pen_cl, i, 5)

  # cv error for the ith element is the cv error portion of the given output
  cv_err[i] <- (output$Error)

  # Training error is boolean of elements that equal actual pen_cl class compared to output class
  train_err[i] <- mean(pen_cl != output$class)


}

# The error table is a data frame of 10 of each type of error
err_table <- (data.frame(cv_err, train_err))

# Call the data frame table
err_table

For both the cv error and training error I noticed it is minimized at k = 1, so I would choose that k value as the model based purely on minimizing those errors. However, in practice choosing such a low k value will produce a very overfit model, albeit a very low-bias model, but it is still not flexible enough of a model to give us any meaningful predictive value once we use test error. In practice, we want to use the cv_error because the training error is error produced from the same data we used to train the model in the first place. It would not make sense to use the ungeneralizable training error in practice with new test data.

Cross validation splits data into parts, using all but one of those parts as "training data" to predict a fit the single part which we refer to as "testing data". We alternate which parts we will use as training data and which parts we will use as testing data until all parts are either one or the other, and this helps us predict future data that was not initially used to train our model.

This code chunk uses random forest prediction on data from my_penguins.

# Intializes empty matrix with 90 rows and 2 columns
cv_est_matr <- matrix(NA, nrow = 90, ncol = 2)

# Initializes vector where 2 repeats 30 times, then 5, 30 times, etc
evals <- rep(c(2,5,10), each = 30)

# For the jth element in the vector from 1 to 90
for (j in 1:90) {

  # The 1st column is the cv error for the jth element in the evals vector
  cv_est_matr[j, 1] <- my_rf_cv(evals[j])$Error

  # The 2nd column is the eval vector value that corresponds to that cv error
  cv_est_matr[j, 2] <- evals[j]

  }

# Converts 90X2 matrix to data frame
cv_est_frame <- as.data.frame(cv_est_matr)

# Renames the first column of the data frame
colnames(cv_est_frame)[1] <- "CV_error"

# Renames the second column of the data frame
colnames(cv_est_frame)[2] <- "Num_folds"

# Creates 2X3 data frame with summary statistics grouped on eval vector value
cv_est_frame_1 <- cv_est_frame %>% 
                  group_by(Num_folds) %>% 
                  summarise(cv_mean = mean(CV_error), 
                            cv_sd = sd(CV_error))

# Calls data frame
cv_est_frame_1

# Splits based on eval vector value
cv_est_frame$Num_folds <- factor(cv_est_frame$Num_folds)

# Makes plot from cv_est_frame data, groups on the x axis, cv error on y axis
ggplot(data = cv_est_frame, mapping = aes(x = Num_folds, y = CV_error)) + 

  # creates three box plots
  geom_boxplot() +

  # add labels
  labs(title = "CV Error by number of folds",
       x = "Number of folds",
       y = "CV error") +

  # Centers and bolds title
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))



#read_csv() read data
#source() read code
#save/saveRDS
#ggsave

It appears as the number of folds we have increases, then the grouped mean and standard deviations for the MSE errors across all our folds decreases. We should expect this, because we are dividing our MSE error over more folds, making the CV error lower as we increase the number of folds. This may indicate that the best model for prediction may have a high value of k.

Our boxplots show a trends that as the number of folds increases, then the CV error decreases, and similarly, the IQR, range, and overall variance of the boxes decreases substantially as well as k increases.



danter2000/NewPackageLecture9 documentation built on Dec. 19, 2021, 8:10 p.m.