Introduction

The goal of stat302Package is to demonstrate my understandings of R packages. This package includes the following functions: - my_t.test - my_lm - my_knn_cv - my_rf_cv

Installation

To download the stat302 package, you can install the development version directly from GitHub.

# install.packages("devtools")
devtools::install_github("laurenng/stat302Package")
library(stat302Package)

Tutorial for my_t.test

First we need to specify the data we are working with. So for all examples in this function, we will be using the life Expectancy column from my_gapminder.

data <- stat302Package::my_gapminder
lifeExpectancy <- data$lifeExp

Next, we will start analyzing my_t.test function. This function takes in 3 parameters:

  1. x - a numeric vector of data

  2. alternative - a character string specifying the alternative hypothesis. - the acceptable values would be "two.sided", "less", "greater"

  3. mu - a number indicating the null hypothesis value of the mean.

The function then returns a list with the following elements: numeric test statistic,degree of freedom, value of alternative parameter, and the p-value

For this tutorial, we'll be looking specifically at the different alternative tests and the significance of the t-test result via the p-value.

Two-sided T-Test

$$ \begin{align} H_0: \mu &= 60\ H_a: \mu &\neq 60 \end{align} $$

# performing a t-test with a less than alternative 
two_sided_t_test <- stat302Package::my_t.test(lifeExpectancy, 
                              alternative = "two.sided",
                              mu = 60)

# printing out the p-value
print(two_sided_t_test$p_val)

From doing a two-sided t test on the life expectancy of gapminder data, we find that the p-value is 0.09. This value is greater than the $\sigma$ = 0.05 cut-off. Therefore, we cannot conclude anything and cannot reject the null hypothesis.

Testing a Hypothesis of:

$$ \begin{align} H_0: \mu &= 60\ H_a: \mu &< 60 \end{align} $$

# performing a t-test with a less than alternative 
less_than_t_test <- stat302Package::my_t.test(lifeExpectancy, 
                                              alternative = "less", 
                                              mu = 60)

# printing out the p-value
print(less_than_t_test$p_val)

From performing a t test on an alternative hypothesis where mu is less than 60, we find that the p-value is 0.95. This value is greater than the $\sigma$ = 0.05 cut-off. Therefore, we cannot conclude anything and cannot reject the null hypothesis.

Greater T-Test

$$ \begin{align} H_0: \mu &= 60\ H_a: \mu &> 60 \end{align} $$

# performing a t-test with a less than alternative 
greater_than_t_test <- stat302Package::my_t.test(lifeExpectancy, 
                                 alternative = "greater",
                                 mu = 60)

# printing out the p-value
print(greater_than_t_test$p_val)

From performing a t test on an alternative hypothesis where mu is less than 60, we find that the p-value is 0.95. This value is less than the $\sigma$ = 0.05 cut-off. Therefore, we can reject the null hypothesis and say than mu is not 60.

Tutorial for my_lm

The my_lm function is a linear model inference and prediction formula. It'll take in a formula

# performing a linear model using the package's my_lm function 
lm_model <- stat302Package::my_lm(lifeExp ~ gdpPercap + continent,
                  data = stat302Package::my_gapminder)

# let's looks speicfically at the gdpPercap variable
gdp_data <- lm_model["gdpPercap", ]
print(gdp_data)

looking specifically at the gdpPercap variable, the model performed tells us that for every 1 increase of GDP per capital, the life expectancy increases by 4.45.

Hypothesis test

$$ \begin{align} H_0: \mu &= 0\ H_a: \mu &\neq 0 \end{align} $$ The hypothesis test is testing whether we can reject or not conclude anything with the estimate beta value of gdpPercap.

P-value interpretation

print(gdp_data["Pr(>|t|)"])

The p-value for this model is nearly 0 and less than the $\sigma$ = 0.05 cutoff. Therefore, we can reject the null hypothesis and beta is not equal to 0.

Actual VS Fitted Boxplot

library(ggplot2)

# FINDING fitted yhat
# getting the matrix of the estimates value 
my_estimates <- as.matrix(lm_model[,"Estimate"])

# fitting data into matrix to create x matrix
x_mat <- model.matrix(lifeExp ~ gdpPercap + continent,
                      stat302Package::my_gapminder)

# matrix multiplication to get yhat 
yhat <- x_mat %*% my_estimates

# declaring actual values 
gapminder_data <- stat302Package::my_gapminder
lifeExpected <- gapminder_data$lifeExp 

# Creating dataframe with actual and fitted values 
my_df <- data.frame(actual = lifeExpected, fitted = yhat)

# plotting the fitted vs actual values 
fitted_actual_scatter <- ggplot(my_df, aes(x = fitted, y = actual)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) + 
  theme_bw(base_size = 15) +
  labs(x = "Fitted values", 
       y = "Actual values", 
       title = "Actual vs. Fitted",
       caption = "This shows off the performance of the model ") +
  theme(plot.title = element_text(hjust = 0.5)) 

fitted_actual_scatter

Each dot represents each row in the dataset. The dotted red reference line is where we want all the dots to lie, as it means that the fitted values are equal to the actual values.

In our diagram, we can see that the lower the fitted variables, the more variance exists. But as we increase, the fitted values are becoming more and more similar to the actual values and following the reference line.

Tutorial for my_knn_cv

library(dplyr)
library(tidyr)

# Creating the datasets used in performing the function, my_knn_cv
data <- stat302Package::my_penguins %>%
        drop_na()
train_data <- data %>%
             dplyr::select(bill_length_mm, bill_depth_mm,
                    flipper_length_mm, body_mass_g)
target_class <- data$species

# initializing the vectors 
miscalc_rates <- c()
cv_rates <- c()

# iterating through 1-10 neighbors in the function my_knn_cv
for (i in 1:10){
  # performing my_knn_cv with the datasets and i number of neighbors
  my_knn <- stat302Package::my_knn_cv(train_data, target_class, i, 5)
  cv_rates <- append(cv_rates, my_knn$cv_error)

  # calculating training miscalc rates 
  knn_miscalc <- class::knn(train_data, train_data, k = i, target_class)
  miscalc_true <- mean(knn_miscalc != target_class)
  miscalc_rates <- append(miscalc_rates, miscalc_true)
}

# formatting the dataframe with the correct headers 
total_set <- data.frame(miscalc_rates, cv_rates)
colnames(total_set) <- c("Training Miscalculation Rate", 
                         "CV Miscalculation Rate")

# viewing the total set 
print(total_set)

Based off the tests perform on my_knn = 1 to 10, I would chose row 1 with 1 neighbor as it has the lowest Cross Validation miscalculation rate. Which means that this model will perform the most accurately with other data the model hasn't seen before.

In practice, I would pick the test with the lowest cross validation misclassification rate as it'll give us insights into how much error there is amongst the multiple folds and tests being performed to create the model. This will also inform us on how the model will perform outside of the sample data. So having a low misclassification rate would be ideal.

what is cross validation??

Cross validation is used to evaluate the model and tells us approximately how much error there is across the splits. This will inform us how the model will perform outside of the sample data and using other data.

Tutorial for my_rf_cv

# initializing dataframe for cross validation error 
all_cv <- data.frame("Name"=character(), 
                 "Value"=double(), 
                 stringsAsFactors=FALSE) 

# performing 30 tests for each fold listed here: 2, 5, 10 
# to create a dataframe utilized in the boxplot below 
for (i in c(2, 5, 10)) {
  for (t in 1:30) {
    all_cv <- rbind(all_cv, c("Name" = i,
                              "Value" = stat302Package::my_rf_cv(i))) 
  }
}

# formatting dataframe for the boxplot
colnames(all_cv) <- c("Name", "Value")
all_cv$Name <- as.factor(all_cv$Name)

# creating the boxplot
cv_boxplot <- ggplot2::ggplot(all_cv, ggplot2::aes(x=Name, y=Value)) + 
  ggplot2::geom_boxplot() + 
  ggplot2::labs(title="Plot of number of folds",
       x="# of folds", 
       y = "Cross Validation error",
       caption = "this boxplot illustrates the variation of 30 tests
       done with the specified number of folds and their CV rates") + 
   ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5))

# displaying the boxplot
cv_boxplot

When analyzing the diagram, we see that the less folds there are, the greater the variance in rates. We also can see that the cross validation rates for 2 folds is overall greater than the other folds. Interestingly enough, we find that the one test that had the lowest cross validation error is with 5 folds. But doesn't mean it is the most reliable.

Table of Mean and Standard Deviation

# creating placeholder for the data 
mean_sd_table <- data.frame("mean"=double(), 
                 "standard deviation"=double(), 
                 stringsAsFactors=FALSE) 

# finding means and standard deviations of each fold group
mean_sd_table <- all_cv %>%
  dplyr::group_by(Name) %>%
  dplyr::summarise(mean = mean(Value), sd = sd(Value))

# printing table out 
mean_sd_table

When looking at the table, we can confirm the observations above are true. We can clearly see a decrease in standard deviations and decrease in mean as the number of folds increase. I think this is the case because we are performing more scenarios and so within the cross validation algorithm, outliers are being canceled out.

With only two folds, there is less opportunity for outlier data to be canceled and balanced out. This is where we see a sharp decrease in standard deviations.

setup files

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(stat302Package)


laurenng/stat302Package documentation built on Dec. 21, 2021, 9:42 a.m.