knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This package is written for statistical inference and prediction, and includes
the functions my_t_test
, my_lm
, my_knn_cv
, and my_rf_cv
.
To install and load this package:
# install firstpackage devtools::install_github("hchang23/firstpackage")
# load firstpackage library(firstpackage)
The t-tests below will be using the lifeExp
variable from my_gapminder
and
interpreting the results with $\alpha$ = 0.05.
# load my_gapminder data data("my_gapminder")
$H_0: \mu = 60$
$H_A: \mu \neq 60$
my_t_test(my_gapminder$lifeExp, "two.sided", 60)
The p_val
from the results is 0.093, which is greater than $\alpha$ = 0.05,
so I fail to reject the null hypothesis. There is not enough evidence to suggest
that the mean life expectancy is not 60 years.
$H_0: \mu = 60$
$H_A: \mu < 60$
my_t_test(my_gapminder$lifeExp, "less", 60)
The p_val
from the results is 0.0466, which is less than $\alpha$ = 0.05,
so I reject the null hypothesis. There is sufficient evidence to suggest that
the mean life expectancy is less than 60 years.
$H_0: \mu = 60$
$H_A: \mu > 60$
my_t_test(my_gapminder$lifeExp, "greater", 60)
The p_val
from the results is 0.95, which is greater than $\alpha$ = 0.05,
so I fail to reject the null hypothesis. There is not enough evidence to suggest
that the mean life expectancy is greater than 60 years.
The linear regression model below will be using lifeExp
as the response
variable and gdpPercap
and continent
as the explanatory variables. The
hypothesis test associated with the coefficient gdpPercap
, which is the GDP
per capita in USD, is:
$H_0: \beta = 0$
$H_A: \beta \neq 0$
test <- my_lm(lifeExp ~ gdpPercap + continent, my_gapminder) test
From the results, the p-value for gdpPercap
coefficient is r test[2, 4]
,
which is less than $\alpha$ = 0.05, so I reject the null hypothesis. There is
strong evidence that there is a relationship between lifeExp
and gdpPercap
.
# make data frame of fitted and actual values my_coeff <- test[, 1] my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) y_hat <- my_matrix %*% as.matrix(my_coeff) my_df <- data.frame("Fitted" = y_hat, "Actual" = my_gapminder$lifeExp, "Continent" = my_gapminder$continent) # use ggplot2 to plot the fitted vs actual values library(ggplot2) ggplot(my_df, aes(x = Actual, y = Fitted, color = Continent)) + geom_point() + geom_abline(slope = 1, intercept = 0) + theme_light()
From the plot, the line is the model fit and the data points are the actual values. The line appears to fit the data points from Europe and Oceania fairly well, but doesn't predict as well for the other continents.
For the k-nearest neighbor cross-validation tutorial, we will be predicting
output class continent
using covariates gdpPercap
and lifeExp
. We will use
a 5-fold cross-validation (k_cv
= 5) and 1- to 10-nearest neighbors
(k_nn
= 1,...,10). In the my_knn_cv
function, the data is split into
k_cv
folds, and all but one fold is used for training data and fitting
the model and the last fold for test data and making prediction, repeated k_cv
times with different folds each time.
# empty vectors cv_error <- vector() train_err <- vector() # run function 10 times with k_nn = 1:10 for (i in 1:10) { knn <- my_knn_cv(my_gapminder[, c(4, 6)], my_gapminder$continent, i, 5) # predict table pred <- table(my_gapminder$continent, knn$class) # CV error cv_error[i] <- knn$CV_error # training error train_err[i] <- 1 - sum(diag(pred) / (sum(rowSums(pred)))) } # table of error values err_table <- cbind(cv_error, train_err) err_table
Looking at just the cv_error
, which is the average cross-validation
misclassification error rate, I would choose to use the model with 10-nearest
neighbors since it has the lowest CV error. Looking at just the train_err
,
I would choose to use the model with 1-nearest neighbor because there is no
training error. In practice, I would choose the model with 5-nearest neighbor to
balance between the CV and training errors.
The my_rf_cv
function uses the random forest algorithm with 100 number of
trees to predict lifeExp
using covariate gdpPercap
from the my_gapminder
data. We will use k
= 2, 5, 10, ran 30 times for each k
, for this tutorial.
# empty data frame cv_mse <- data.frame() # vector of k values k_val <- c(2, 5, 10) # 3 k values for (i in 1:3) { # 30 iterations of function for (j in 1:30) { cv_mse[30 * (i - 1) + j, 1] <- k_val[i] # elements of column are the ith k cv_mse[30 * (i - 1) + j, 2] <- my_rf_cv(k_val[i]) # function for ith k } } colnames(cv_mse) <- c("k", "MSE") cv_mse$k <- factor(cv_mse$k, levels = c("2", "5", "10"))
# create side-by-side boxplot of the cross-validation MSE for each k ggplot(cv_mse, aes(k, MSE, fill = k)) + geom_boxplot() + labs(title = "CV Estimated MSE for Each k-Fold", x = "\nk-Folds", y = "Estimated MSE\n") + theme_light() + theme(plot.title = element_text(hjust = 0.5, size = 12), legend.position = "none")
# empty vectors avg_cv <- vector() sd_cv <- vector() # calculate mean and standard deviation of CV estimates for each k for (i in 1:3) { avg_cv[i] <- mean(cv_mse[cv_mse$k == k_val[i], 2]) sd_cv[i] <- sd(cv_mse[cv_mse$k == k_val[i], 2]) } # create table of CV means and sds cv_table <- cbind("k-Fold" = c(2, 5, 10), "Avg_CV" = avg_cv, "CV_SD" = sd_cv) cv_table
Looking at both the boxplot and the table, the median CV estimates are not too
different, but the variance increases as the number of folds increases. For
k
= 2, the standard deviation is r cv_table[1, 3]
while for k
= 10, the
standard deviation is r cv_table[3, 3]
. This happens because there is a
bias-variance tradeoff when doing k-fold cross validation. Lower values of k have
higher bias but lower variance, and as k approaches n, the number of training
data, the estimates have lower bias but variance increases.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.