knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(ggplot2) library(kableExtra)
This package contains 4 statistics functions for inference/prediction:
my_t.test
: performs a t-testmy_lm
: fits data to a modelmy_knn_cv
: performs k-nearest-neighbor cross validation for statistical predictionmy_rf_cv
: performs random forest cross validation for statistical predictionBelow are instructions for getting started:
You can download the package from GitHub:
``` {r, eval = FALSE}
devtools::install_github("BenjaminLowry/project3package", build_vignette = TRUE, build_opts = c())
Then import the package via: ```r library(project3package)
Below are three examples of t-tests that can be performed with my_t.test
using the lifeExp
(life expectancy) data from my_gapminder
which is included in the package:
Test #1: $H_0 : \mu = 60, H_a : \mu \ne 60$
test_1 <- my_t.test(my_gapminder$lifeExp, "two.sided", 60) test_1
With $\alpha = 0.05$, the p-value indicates that this t statistic is not significant since r test_1$p_val
> $\alpha$ and thus we do not reject the null hypothesis.
Test #2: $H_0 : \mu = 60, H_a : \mu < 60$
test_2 <- my_t.test(my_gapminder$lifeExp, "less", 60) test_2
With $\alpha = 0.05$, the p-value indicates that this t statistic is significant since r test_2$p_val
< $\alpha$ and thus we reject the null hypothesis.
Test #3: $H_0 : \mu = 60, H_a : \mu > 60$
test_3 <- my_t.test(my_gapminder$lifeExp, "greater", 60) test_3
With $\alpha = 0.05$, the p-value indicates that this t statistic is not significant since r test_3$p_val
> $\alpha$ and thus we do not reject the null hypothesis.
Below is an example of using my_lm
to perform a regression where my_gapminder$lifeExp
is the response variable and my_gapminder$gdpPercap
and my_gapminder$continent
are the explanatory variables.
my_fit <- my_lm(lifeExp ~ gdpPercap * continent, my_gapminder[c("lifeExp", "gdpPercap", "continent")]) my_fit
"gdpPercap" has a coefficient of r my_fit$Estimate[2]
as shown in the table above. This means that if all other variables stayed the same (and thus we were ignoring the prescence or effect of the interaction variables in the table above) then for every increase in unit of GDP per capita in a country would correlate with an increase of r my_fit$Estimate[2]
years of life expectancy.
The hypothesis test associated with this coefficient is: $H_0: \beta_1 = 0, H_a: \beta_1 \ne 0$ where $\beta_1$ is the coefficient for "gdpPercap". The p-value for the t statistic given by the fit is r my_fit[[4]][2]
which is much much less than $\alpha = 0.5$ and thus we reject the null hypothesis that $\beta_1 = 0$.
lm_fit <- lm(lifeExp ~ gdpPercap * continent, my_gapminder[c("lifeExp", "gdpPercap", "continent")]) mod_fits <- fitted(lm_fit) my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = mod_fits) ggplot(my_df, aes(x = fitted, y = actual)) + geom_point() + geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) + theme_bw(base_size = 12) + labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") + theme(plot.title = element_text(hjust = 0.5))
This plot shows that this fit is not great at explaining the relationship between the variables. Especially at small values of $\hat{y}$, the variance away from the fit is quite large. There are sections that match the fit quite well like from the fitted values from 70 to 80, but overall I think this graph shows that you would likely want to experiment with the model you are using to related life expectancy, GDP per capita, and continent since this fit is no where near a clean fit.
covariates <- my_penguins[c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")] class <- my_penguins[["species"]] cv_errs <- vector(length = 10) train_errs <- vector(length = 10) for (i in 1:10) { cv_errs[i] <- my_knn_cv(covariates, class, i, 5)$cv_err # Count how many data points are misclassified using normal knn train_errs[i] <- length(which(class::knn(covariates, covariates, class, i) != class)) / nrow(my_penguins) } # Show table of errors err_df <- cbind(1:10, data.frame(cv_errs), data.frame(train_errs)) colnames(err_df) <- c("k_nn", "cv_errs", "train_errs") err_df
With the goal of minimizing error, I would choose the k_nn = 1 model when considering both the CV error and the training error since this is where the error is the lowest in both cases. However, in practice we know that training error is going to be zero with k_nn = 1 since the data is used for both the training and the testing, and that this model is likely overfitted for the given data and will not generalize well to new data. Choosing a cross validation model will be much better in this respect since it was trained with 5 folds which separate the training and test data to be disjoint sets (where every fold is the test set once and all other folds are the training data for that iteration of the cross validation) and thus in this way better trains the model to be more accurate when tested on data that it has never seen before. The best CV model based on the table above is the where k_nn = 1 (since it has lowest misclassification error) and so this is the model I would choose.
mse_matrix <- matrix(nrow = 90, ncol = 2) ks <- c(2, 5, 10) for (i in 1:3) { k <- ks[i] for (j in 1:30) { mse_matrix[(i-1)*30+j, 1] <- k mse_matrix[(i-1)*30+j, 2] <- my_rf_cv(k) } } mse_df <- data.frame(k = mse_matrix[,1], mse = mse_matrix[,2]) ggplot(data = mse_df, aes(group = k, x = k, y = mse)) + geom_boxplot() + labs(title = "MSE by Number of Folds", x = "Number of Folds", y = "MSE") + theme_bw() + theme(plot.title = element_text(hjust = 0.5)) # Get vectors of MSE by value of k mse_k2 <- (mse_df %>% dplyr::filter(k == 2) %>% dplyr::select(mse))$mse mse_k5 <- (mse_df %>% dplyr::filter(k == 5) %>% dplyr::select(mse))$mse mse_k10 <- (mse_df %>% dplyr::filter(k == 10) %>% dplyr::select(mse))$mse mse_stats_mat <- rbind(c(2, mean(mse_k2), sd(mse_k2)), c(5, mean(mse_k5), sd(mse_k5)), c(10, mean(mse_k10), sd(mse_k10))) mse_stats_df <- data.frame(k = mse_stats_mat[,1], mse_mean = mse_stats_mat[,2], mse_sd = mse_stats_mat[,3]) kable_styling(kable(mse_stats_df))
The box plot shows that first of all, the mean MSE decreases as the number of folds increases, with the largest decrease happening between k = 2 and k = 5. It also shows the range and standard deviation of the MSEs decreasing significantly as the number of folds increases. This is substantiated by the table which shows the mean dropping by r round((1 - (mse_stats_df$mse_mean[2] / mse_stats_df$mse_mean[1])) * 100, digits = 1)
% from k = 2 to k = 5 and then dropping r round((1 - (mse_stats_df$mse_mean[3] / mse_stats_df$mse_mean[2])) * 100, digits = 1)
% from k = 5 to k = 10, and also shows the standard deviation decreasing by about r round((1 - (mse_stats_df$mse_sd[2] / mse_stats_df$mse_sd[1])) * 100, digits = 1)
% from k = 2 to k = 5 and then decreasing another r round((1 - (mse_stats_df$mse_sd[3] / mse_stats_df$mse_sd[2])) * 100, digits = 1)
% from k = 5 to k = 10. This shows that the relative magnitude of the standard deviation's decrease is larger than the relative magnitude that the mean decreases by. I think that MSE decreases as number of folds increases perhaps because when there are fewer folds, the test data sets are quite large relative to the size of the training sets, which makes me think that there is not enough data to train the iteration of the cross validation in order to accurately predict the number of test data points that are given, and this lower accuracy results in higher MSE. This improves as the number of folds increase as the test set becomes small relative to all the data points that the model is trained on. I think this is also related to why the standard deviation of the MSE decreases because when there are fewer folds, if the data is split in a way where the training and test data is fairly different or fairly the same then you'll see relatively high and low MSE's, respectively. So the way the data is split can affect the MSE more when there are less folds compared to when there are more folds when the splitting isn't as impactful on the overall MSE.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.