The goal of stat302Package is to demonstrate my understandings of R packages. This package includes the following functions: - my_t.test - my_lm - my_knn_cv - my_rf_cv
To download the stat302 package, you can install the development version directly from GitHub.
# install.packages("devtools") devtools::install_github("laurenng/stat302Package") library(stat302Package)
First we need to specify the data we are working with. So for all examples in this function, we will be using the life Expectancy column from my_gapminder.
data <- stat302Package::my_gapminder lifeExpectancy <- data$lifeExp
Next, we will start analyzing my_t.test function. This function takes in 3 parameters:
x - a numeric vector of data
alternative - a character string specifying the alternative hypothesis. - the acceptable values would be "two.sided", "less", "greater"
mu - a number indicating the null hypothesis value of the mean.
The function then returns a list with the following elements: numeric test statistic,degree of freedom, value of alternative parameter, and the p-value
For this tutorial, we'll be looking specifically at the different alternative tests and the significance of the t-test result via the p-value.
$$ \begin{align} H_0: \mu &= 60\ H_a: \mu &\neq 60 \end{align} $$
# performing a t-test with a less than alternative two_sided_t_test <- stat302Package::my_t.test(lifeExpectancy, alternative = "two.sided", mu = 60) # printing out the p-value print(two_sided_t_test$p_val)
From doing a two-sided t test on the life expectancy of gapminder data, we find that the p-value is 0.09. This value is greater than the $\sigma$ = 0.05 cut-off. Therefore, we cannot conclude anything and cannot reject the null hypothesis.
$$ \begin{align} H_0: \mu &= 60\ H_a: \mu &< 60 \end{align} $$
# performing a t-test with a less than alternative less_than_t_test <- stat302Package::my_t.test(lifeExpectancy, alternative = "less", mu = 60) # printing out the p-value print(less_than_t_test$p_val)
From performing a t test on an alternative hypothesis where mu is less than 60, we find that the p-value is 0.95. This value is greater than the $\sigma$ = 0.05 cut-off. Therefore, we cannot conclude anything and cannot reject the null hypothesis.
$$ \begin{align} H_0: \mu &= 60\ H_a: \mu &> 60 \end{align} $$
# performing a t-test with a less than alternative greater_than_t_test <- stat302Package::my_t.test(lifeExpectancy, alternative = "greater", mu = 60) # printing out the p-value print(greater_than_t_test$p_val)
From performing a t test on an alternative hypothesis where mu is less than 60, we find that the p-value is 0.95. This value is less than the $\sigma$ = 0.05 cut-off. Therefore, we can reject the null hypothesis and say than mu is not 60.
The my_lm function is a linear model inference and prediction formula. It'll take in a formula
# performing a linear model using the package's my_lm function lm_model <- stat302Package::my_lm(lifeExp ~ gdpPercap + continent, data = stat302Package::my_gapminder) # let's looks speicfically at the gdpPercap variable gdp_data <- lm_model["gdpPercap", ] print(gdp_data)
looking specifically at the gdpPercap variable, the model performed tells us that for every 1 increase of GDP per capital, the life expectancy increases by 4.45.
$$ \begin{align} H_0: \mu &= 0\ H_a: \mu &\neq 0 \end{align} $$ The hypothesis test is testing whether we can reject or not conclude anything with the estimate beta value of gdpPercap.
print(gdp_data["Pr(>|t|)"])
The p-value for this model is nearly 0 and less than the $\sigma$ = 0.05 cutoff. Therefore, we can reject the null hypothesis and beta is not equal to 0.
library(ggplot2) # FINDING fitted yhat # getting the matrix of the estimates value my_estimates <- as.matrix(lm_model[,"Estimate"]) # fitting data into matrix to create x matrix x_mat <- model.matrix(lifeExp ~ gdpPercap + continent, stat302Package::my_gapminder) # matrix multiplication to get yhat yhat <- x_mat %*% my_estimates # declaring actual values gapminder_data <- stat302Package::my_gapminder lifeExpected <- gapminder_data$lifeExp # Creating dataframe with actual and fitted values my_df <- data.frame(actual = lifeExpected, fitted = yhat) # plotting the fitted vs actual values fitted_actual_scatter <- ggplot(my_df, aes(x = fitted, y = actual)) + geom_point() + geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) + theme_bw(base_size = 15) + labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted", caption = "This shows off the performance of the model ") + theme(plot.title = element_text(hjust = 0.5)) fitted_actual_scatter
Each dot represents each row in the dataset. The dotted red reference line is where we want all the dots to lie, as it means that the fitted values are equal to the actual values.
In our diagram, we can see that the lower the fitted variables, the more variance exists. But as we increase, the fitted values are becoming more and more similar to the actual values and following the reference line.
library(dplyr) library(tidyr) # Creating the datasets used in performing the function, my_knn_cv data <- stat302Package::my_penguins %>% drop_na() train_data <- data %>% dplyr::select(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g) target_class <- data$species # initializing the vectors miscalc_rates <- c() cv_rates <- c() # iterating through 1-10 neighbors in the function my_knn_cv for (i in 1:10){ # performing my_knn_cv with the datasets and i number of neighbors my_knn <- stat302Package::my_knn_cv(train_data, target_class, i, 5) cv_rates <- append(cv_rates, my_knn$cv_error) # calculating training miscalc rates knn_miscalc <- class::knn(train_data, train_data, k = i, target_class) miscalc_true <- mean(knn_miscalc != target_class) miscalc_rates <- append(miscalc_rates, miscalc_true) } # formatting the dataframe with the correct headers total_set <- data.frame(miscalc_rates, cv_rates) colnames(total_set) <- c("Training Miscalculation Rate", "CV Miscalculation Rate") # viewing the total set print(total_set)
Based off the tests perform on my_knn = 1 to 10, I would chose row 1 with 1 neighbor as it has the lowest Cross Validation miscalculation rate. Which means that this model will perform the most accurately with other data the model hasn't seen before.
In practice, I would pick the test with the lowest cross validation misclassification rate as it'll give us insights into how much error there is amongst the multiple folds and tests being performed to create the model. This will also inform us on how the model will perform outside of the sample data. So having a low misclassification rate would be ideal.
Cross validation is used to evaluate the model and tells us approximately how much error there is across the splits. This will inform us how the model will perform outside of the sample data and using other data.
# initializing dataframe for cross validation error all_cv <- data.frame("Name"=character(), "Value"=double(), stringsAsFactors=FALSE) # performing 30 tests for each fold listed here: 2, 5, 10 # to create a dataframe utilized in the boxplot below for (i in c(2, 5, 10)) { for (t in 1:30) { all_cv <- rbind(all_cv, c("Name" = i, "Value" = stat302Package::my_rf_cv(i))) } } # formatting dataframe for the boxplot colnames(all_cv) <- c("Name", "Value") all_cv$Name <- as.factor(all_cv$Name) # creating the boxplot cv_boxplot <- ggplot2::ggplot(all_cv, ggplot2::aes(x=Name, y=Value)) + ggplot2::geom_boxplot() + ggplot2::labs(title="Plot of number of folds", x="# of folds", y = "Cross Validation error", caption = "this boxplot illustrates the variation of 30 tests done with the specified number of folds and their CV rates") + ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5)) # displaying the boxplot cv_boxplot
When analyzing the diagram, we see that the less folds there are, the greater the variance in rates. We also can see that the cross validation rates for 2 folds is overall greater than the other folds. Interestingly enough, we find that the one test that had the lowest cross validation error is with 5 folds. But doesn't mean it is the most reliable.
# creating placeholder for the data mean_sd_table <- data.frame("mean"=double(), "standard deviation"=double(), stringsAsFactors=FALSE) # finding means and standard deviations of each fold group mean_sd_table <- all_cv %>% dplyr::group_by(Name) %>% dplyr::summarise(mean = mean(Value), sd = sd(Value)) # printing table out mean_sd_table
When looking at the table, we can confirm the observations above are true. We can clearly see a decrease in standard deviations and decrease in mean as the number of folds increase. I think this is the case because we are performing more scenarios and so within the cross validation algorithm, outliers are being canceled out.
With only two folds, there is less opportunity for outlier data to be canceled and balanced out. This is where we see a sharp decrease in standard deviations.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(stat302Package)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.