knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(STAT302package)
Welcome to Andrey Risukhin's Stat 302 Package! With this package, users can:
Run t-tests for the lesser, greater, and two-sided alternate hypothesis using my_t.test()
(note, this script is in the file labeled my_t_test.R
)
Fit a linear (polynomial) model to data using my_lm()
Use my_knn_cv
: a custom function to return knn predictions and evaluate its quality through cross validation.
Use my_rf_cv
: A custom random forest prediction function that also returns cross validation error.
To use this package, first install it with:
devtools::install_github("andreyrisukhin/STAT302package")
Then, run the following to use it in a given project:
library(STAT302package)
# Get the data gap_data <- STAT302package::my_gapminder#$lifeExp life_exp_data <- my_gapminder$lifeExp # Perform t-tests ah_not_60 <- my_t.test(life_exp_data, "two.sided", 60) ah_less_60 <- my_t.test(life_exp_data, "less", 60) ah_greater_60 <- my_t.test(life_exp_data, "greater", 60)
The test for the alternate hypothesis being not equal to 60 returns a P-value of r ah_not_60$p_val
.
The test for the alternate hypothesis being less than 60 returns a P-value of r ah_less_60$p_val
.
The test for the alternate hypothesis being greater than 60 returns a P-value of r ah_greater_60$p_val
.
Given an alpha of 0.05, we fail to reject the first and third null hypotheses, and fail to conclude that the mean life expectancy is not equal to 60 or is greater than 60. We are able to reject the second null hypothesis, concluding that the mean life expectancy is less than 60.
# Get the data gap_data <- STAT302package::my_gapminder#$lifeExp # Fit with linear regression output <- my_lm(lifeExp ~ gdpPercap + continent, data = gap_data) output
Given this output, we conclude the following:
The gdpPercap coefficient has the lowest value ("Estimate") and standard error of any coefficients. Therefore, gdpPercap is a feature which does not vary much, and in which a small change strongly affects the model prediction (hence the small coefficient).
The hypothesis test associated with the gdpPercap coefficient is a two-sided test, in which we reject the null hypothesis that this coefficient is zero because our P-value is much less than the threshold of 0.05.
Let us plot the actual and fitted values:
# To get Actual vs. Fitted, you need to multiply X %*% beta. To get beta, use your estimates. To get X, use `model.matrix(FORMULA, DATA)` beta <- c(output[[1]], output[[2]], output[[3]], output[[4]], output[[5]], output[[6]]) X <- model.matrix(lifeExp ~ gdpPercap + continent, data = gap_data) fitted <- X %*% beta actual <- gap_data$lifeExp act_vs_fit <- data.frame(actual, fitted, gap_data$continent) colnames(act_vs_fit) <- c("actual", "fitted", "continent")
library(dplyr) library(ggplot2) # Plot actual vs fitted, colored by continent plot_actual <- ggplot2::ggplot(data = act_vs_fit, aes(x = fitted, y = actual, color = continent)) + geom_point() + labs(title = "Actual vs Predicted Life Expectancy", x = "Life Expectancy From Fitted Model (years)", y = "True Life Expectancy (years)", caption = "Figure 1. Scatterplot of predicted and actual life expectancy, colored by continent.") + theme_bw(base_size = 16) + theme(plot.title = element_text(hjust = 0.5, face = "bold"), axis.text.x = element_text(angle = 0, hjust = 1, vjust = 1), plot.caption.position = "plot", plot.caption = element_text(hjust = 0)) plot_actual
The plot gives us insight into the differences between the ground truth for life expectancy, and the life expectancy predicted by the model. We see that for true life expectancy under 70, the model assigns a narrow and inaccurate life expectancy to each continent group. Once the true life expectancy supersedes roughly 70, the model becomes more accurate, increasing the predicted year with each increase in actual year (having a slope of 1). However, in these linear segments the model is only accurate for European and Oceanian life expectancy: it underpredicts by 5 to 10 years for life expectancy in the Americas, by 10 to 20 years in Asia, and never stops predicting 45 to 55 for African life expectancy. There are 8 Asian life expectancies the model seriously overpredicts, the most extreme by 45 years (true value of 60, predicted value of 105).
This suggests a linear model is not complex enough to capture the true relationships within the data.
This function used Cross-validation and k-nearest neighbors algorithms
# Remove NA in penguins dataset penguins_df <- na.omit(palmerpenguins::penguins) # Get the train and cl parameter from the data train <- penguins_df[, 3:6] cl <- penguins_df$species # Create a prediction list & error vector pred_class <- list() cv_err <- vector() # Create a for loop 1:10 for (j in 1:10) { temp <- my_knn_cv(train, cl, j, 5) pred_class[[j]] <- temp[[1]] cv_err[j] <- temp[[2]] } # Calculate the training error train_err <- vector() for (i in 1:10) { train_err[i] <- sum(pred_class[[i]] != penguins_df$species) / length(pred_class[[i]]) } # Create a table to hold the results compare_err <- cbind.data.frame(cv_err, train_err) compare_err
The function my_rf_cv can be used to predicts the lifeExp using k-fold Cross-validation and random forest algorithms.
For k in a range of 2, 5, 10 and iterate each case 30 times.
cv_error <- matrix(NA, nrow = 90, ncol = 2) row <- 1 for(i in 1:30) { cv_error[i, 1] <- 2 cv_error[i, 2] <- my_rf_cv(2) cv_error[i + 30, 1] <- 5 cv_error[i + 30, 2] <- my_rf_cv(5) cv_error[i + 60, 1] <- 10 cv_error[i + 60, 2] <- my_rf_cv(10) } cv_error
df <- data.frame("k" = cv_error[, 1], "mse" = cv_error[, 2]) ggplot(df, aes(x = k, y = mse, group = k, fill = factor(k))) + geom_boxplot() + labs(title = "MSE of k-folds", x = "Number of k", y = "MSE", fill = "# of Folds") + theme_bw(base_size = 12) + scale_x_continuous(breaks = c(2, 5, 10)) + theme(plot.title = element_text(hjust = 0.8), legend.title = element_text(hjust = 0.8, size = 12))
result_tbl <- df %>% group_by(k) %>% summarise(mean = mean(mse), sd = sd(mse)) %>% select(-k) %>% as.matrix() %>% as.table() rownames(result_tbl) <- c("k = 2", "k = 5", "k = 10") result_tbl
Based on the boxplot and table result above, As the k number increase,the mean and the median of mse increases while the the range of mse and the standard deviation decreases. The more folds we have, in conjunction with the increasing in iterations of the data, it definitely help the model perform better.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.