In hadiyusri/project3package: Tests Related To Hypothesis Testing, Linear Regression, Cross Validation.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This package consist of 4 statistical function that developed through the course STAT 302 in University of Washington. The goal of this package is to compute four different functions that calculate the t-test hypothesis testing of a data, the linear regression models, and two methods of cross validation (the k-nearest neighbors and random forest cross validation). The package will highlight distinct keywords between functions; whether they are inference, prediction, or a merge of both approaches in statistical analysis. The package can be installed by the following line of code:

library(magrittr)
library(dplyr)
library(project3package)
library(kableExtra)
library(ggplot2)
library(tidyr)

devtools::install_github("anasmuhd/FirstPackage", build_vignette = TRUE, build_opts = c())
library(FirstPackage)

Tutorials

This set of tutorials consists of four functions to be tested under this package: my_t.test, my_lm, my_knn_cv, and my_rf_cv.

my_t.test

This function calculates values associated to the t-test such as the test statistics, p-value, degree of freedom, and the conclusion on whether we reject or fail to reject the null hypothesis tested from lifeExp data in the gapminder package.
my_t.test function will evaluate all hypothesis (less, greater, two side) of the data $lifeExp$ from $my_gapminder$

$$H_0: \mu = 60$$ $$H_a: \mu \neq 60$$

$$\alpha = 0.05$$

Test 1, alternative = less

# my_gapminder <- gapminder::gapminder
# my_t.test(my_gapminder$lifeExp, "less", 60)

Explanation: Running the t-test with alternative = less generates output with p-value that is smaller than 0.05. Therefore, we have shown that we can reject the null hypothesis, given that the mu is equal to 60.

Test 2, alternative = greater

# my_t.test(my_gapminder$lifeExp, "greater", 60)

Explanation: Running the t-test with alternative = greater generates output with p-value that is greater than 0.05. Therefore, we fail to reject the null hypothesis, given that the mu is equal to 60.

Test 3, alternative = two.sided

# my_t.test(my_gapminder$lifeExp, "two.sided", 60)

Explanation: Running the t-test with alternative = two.sided generates output with p-value that is greater than 0.05. Therefore, we fail to reject the null hypothesis, given that the mu is equal to 60

my_lm

This function is used to fit linear model and carry out linear regression. The function carries the following steps: -Demonstrate a regression using lifeExp as response variable and gdpPercap and continent as explanatory variables. -Carefully interpret the gdpPercap coefficient. -Write the hypothesis test associated with the gdpPercap coefficient. -Carefully interpret the results the gdpPercap hypothesis test using a p-value cut-off of α=0.05. -Use ggplot2 to plot the Actual vs. Fitted values.
Interpret the Actual vs. Fitted plot

The demonstration is as following:

# my_model <- my_lm(lifeExp ~ gdpPercap * continent, my_gapminder[c("lifeExp", "gdpPercap", "continent")])
# my_model

Deliberated from the output, we can see that the estimated gdpPercap coefficient is at 0.00137712 which means that as gdpPercap increaees by 1 time, the changes in lifeExp will be in a value of 0.00137712, which makes the rate of growth for gdpPercap almost 100 times higher than the lifeExp.

# lm_fit <- lm(lifeExp ~ gdpPercap * continent,
#              my_gapminder[c("lifeExp", "gdpPercap", "continent")])
# mod_fits <- fitted(lm_fit)
# my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = mod_fits)
# ggplot(my_df, aes(x = fitted, y = actual)) +
#   geom_point() +
#   geom_abline(slope = 1, intercept = 0, col = "blue", lty = 1) +
#   theme_bw() +
#   labs(x = "Predicted values", y = "Actual values", title = "Actual vs. Predicted") +
#   theme(plot.title = element_text(hjust = 0.5))

Explanation: As we can see, the fitted values allows a linear regression line to be formed because it takes the mean of every predicted values. The regression slope represents the average change in the actual values for every fitted unit of each value. The average change is then being used to generate the ratio of increment or decrement between two variables being observed in the data.

my_knn_cv

This function predict output class \code{species} using covariates \code{bill_length_mm}, \code{bill_depth_mm}, \code{flipper_length_mm}, and \code{body_mass_g}. We will be using a 5-fold cross validation to test whether it is working or not. parameters: train- input data frame cl- true class value of your training data k_nn- integer representing the number of neighbors k_cv- integer representing the number of folds
the function returns a list that contain objects; class: a vector of the predicted class $\hat{Y}_{i}$ for all observations cv_err: a numeric with the cross-validation misclassification error

The demonstration is as following:

# change
# my_penguins <- my_penguins %>% drop_na()
# data_Frame <- my_penguins[c("bill_length_mm", "bill_depth_mm",
#                             "flipper_length_mm", "body_mass_g")]
# class <- my_penguins$species
# cv_errs <- vector(length = 10)
# train_errs <- vector(length = 10)
# for (i in 1:10) {
#   cv_errs[i] <- my_knn_cv(data_Frame, class, i, 5)$cv_err
#   # Count how many data points are misclassified using normal knn
#   if (knn(data_Frame, data_Frame, class, i) != class) {
#     train_errs[i] <- length(knn(data_Frame, data_Frame, class, i)) / nrow(my_penguins)
#   }
# }
# # Print table of errors
# err_df <- cbind(1:10, data.frame(cv_errs), data.frame(train_errs))
# colnames(err_df) <- c("k_nn", "cv_errors", "train_errors")
# err_df

Explanation: By looking at the training misclassification rates, we would choose the value of k_nn = 1 with the error of 0.000. However, looking at the CV misclassification rates, we would choose tany value of k_nn that end up with the error of 0.8018018. The model we would choose in practice is the one with cross-validation method since the error turns out to be consistent and that the gap between the cv errors and training errors become smaller as the value of k_nn increases. The small gap represents more accuracy in training vs. test data being used for the observation. The process of cross validation is a part of sample-splitting procedure to fold the data into small segments to be analyzed as to ease the calculation in real world data that are uncertain and too many. The processes are to first split data into k parts (folds), to use all but 1 fold as your training data and fit the model , to use the remaining fold for your test data and make predictions , to switch which fold is your test data and repeat steps 2 and 3 until all folds have been test data (k times), and lastly, to compute squared error of the folds.

my_rf_cv

This function predict output \code{body_mass_g} using covariates \code{bill_length_mm}, \code{bill_depth_mm}, and \code{flipper_length_mm}. parameter: k- the number of folds The function returns the numeric with cross validation error.

# data(my_penguins)
# my_penguins <- my_penguins %>% drop_na()
# 
# mse_mat <- matrix(nrow = 90, ncol = 2)
# ks <- c(2, 5, 10)
# for (i in 1:3) {
#   k <- ks[i]
#   for (j in 1:30) {
#     mse_mat[(i-1)*30+j, 1] <- k
#     mse_mat[(i-1)*30+j, 2] <- my_rf_cv(k)
#   }
# }
# mse_df <- data.frame(names = mse_mat[,1], mse = mse_mat[,2])
# ggplot(data = mse_df, aes(group = k, x = k, y = mse)) + 
#   geom_boxplot(fill = "yellow") +
#   labs(title = "MSE based on Folds", x = "Number of Folds", y = "MSE") +
#   theme_bw(base_size = 15) + 
#   theme(plot.title =
#           element_text(hjust = 0.5),
#         text = element_text(size = 10))
# 
# 
# mse_k2 <- (mse_df %>% filter(k == 2) %>% select(mse))$mse
# mse_k5 <- (mse_df %>% filter(k == 5) %>% select(mse))$mse
# mse_k10 <- (mse_df %>% filter(k == 10) %>% select(mse))$mse
# mse_stat <- rbind(c(2, mean(mse_k2), sd(mse_k2)), 
#                        c(5, mean(mse_k5), sd(mse_k5)), 
#                        c(10, mean(mse_k10), sd(mse_k10)))
# mse_stats_df <- data.frame(k = mse_stat[,1],
#                            mse_mean = mse_stat[,2], 
#                            mse_sd = mse_stat[,3])
# 
# kable_styling(kable(mse_stats_df))

Explanation: from the boxplot, we can see that the number of folds = 2 records the highest mean among all three observations. However, the box is stretched to its biggest width among all three, which also can be seen in the table where 2 folds record the highest standard deviation of 6296.5 whereas the lowest standard deviation is recorded in 10 folds with 2433.0. This is the expected case as smaller number of folds tend to be collected in a large group which also include extreme values that stretches the mean and increase the variability of the data. Higher number of folds means that most observations are categorized based on approximated placement in the graph which means that the mean of every fold is lower due to lack of extreme values. Hence, this lack of variability also decreases the standard deviation that explains the distribution patterns in the data folds.

hadiyusri/project3package documentation built on Dec. 20, 2021, 2:40 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com