In hchang23/firstpackage: Package Demo

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

This package is written for statistical inference and prediction, and includes the functions my_t_test, my_lm, my_knn_cv, and my_rf_cv.

To install and load this package:

# install firstpackage
devtools::install_github("hchang23/firstpackage")

# load firstpackage
library(firstpackage)

Tutorial: my_t_test

The t-tests below will be using the lifeExp variable from my_gapminder and interpreting the results with $\alpha$ = 0.05.

# load my_gapminder data
data("my_gapminder")

Test 1: alternative is "two.sided"

$H_0: \mu = 60$

$H_A: \mu \neq 60$

my_t_test(my_gapminder$lifeExp, "two.sided", 60)

The p_val from the results is 0.093, which is greater than $\alpha$ = 0.05, so I fail to reject the null hypothesis. There is not enough evidence to suggest that the mean life expectancy is not 60 years.

Test 2: alternative is "less"

$H_0: \mu = 60$

$H_A: \mu < 60$

my_t_test(my_gapminder$lifeExp, "less", 60)

The p_val from the results is 0.0466, which is less than $\alpha$ = 0.05, so I reject the null hypothesis. There is sufficient evidence to suggest that the mean life expectancy is less than 60 years.

Test 3: alternative is "greater"

$H_0: \mu = 60$

$H_A: \mu > 60$

my_t_test(my_gapminder$lifeExp, "greater", 60)

The p_val from the results is 0.95, which is greater than $\alpha$ = 0.05, so I fail to reject the null hypothesis. There is not enough evidence to suggest that the mean life expectancy is greater than 60 years.

Tutorial: my_lm

The linear regression model below will be using lifeExp as the response variable and gdpPercap and continent as the explanatory variables. The hypothesis test associated with the coefficient gdpPercap, which is the GDP per capita in USD, is:

$H_0: \beta = 0$

$H_A: \beta \neq 0$

test <- my_lm(lifeExp ~ gdpPercap + continent, my_gapminder)
test

From the results, the p-value for gdpPercap coefficient is r test[2, 4], which is less than $\alpha$ = 0.05, so I reject the null hypothesis. There is strong evidence that there is a relationship between lifeExp and gdpPercap.

# make data frame of fitted and actual values
my_coeff <- test[, 1]
my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder)
y_hat <- my_matrix %*% as.matrix(my_coeff)
my_df <- data.frame("Fitted" = y_hat, 
                    "Actual" = my_gapminder$lifeExp,
                    "Continent" = my_gapminder$continent)

# use ggplot2 to plot the fitted vs actual values
library(ggplot2)
ggplot(my_df, aes(x = Actual, y = Fitted, color = Continent)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  theme_light()

From the plot, the line is the model fit and the data points are the actual values. The line appears to fit the data points from Europe and Oceania fairly well, but doesn't predict as well for the other continents.

Tutorial: my_knn_cv

For the k-nearest neighbor cross-validation tutorial, we will be predicting output class continent using covariates gdpPercap and lifeExp. We will use a 5-fold cross-validation (k_cv = 5) and 1- to 10-nearest neighbors (k_nn = 1,...,10). In the my_knn_cv function, the data is split into k_cv folds, and all but one fold is used for training data and fitting the model and the last fold for test data and making prediction, repeated k_cv times with different folds each time.

# empty vectors
cv_error <- vector()
train_err <- vector()

# run function 10 times with k_nn = 1:10
for (i in 1:10) {
  knn <- my_knn_cv(my_gapminder[, c(4, 6)], my_gapminder$continent, i, 5)
  # predict table
  pred <- table(my_gapminder$continent, knn$class)
  # CV error
  cv_error[i] <- knn$CV_error
  # training error
  train_err[i] <- 1 - sum(diag(pred) / (sum(rowSums(pred))))
}

# table of error values
err_table <- cbind(cv_error, train_err)
err_table

Looking at just the cv_error, which is the average cross-validation misclassification error rate, I would choose to use the model with 10-nearest neighbors since it has the lowest CV error. Looking at just the train_err, I would choose to use the model with 1-nearest neighbor because there is no training error. In practice, I would choose the model with 5-nearest neighbor to balance between the CV and training errors.

Tutorial: my_rf_cv

The my_rf_cv function uses the random forest algorithm with 100 number of trees to predict lifeExp using covariate gdpPercap from the my_gapminder data. We will use k = 2, 5, 10, ran 30 times for each k, for this tutorial.

# empty data frame
cv_mse <- data.frame()
# vector of k values
k_val <- c(2, 5, 10)

# 3 k values
for (i in 1:3) {
  # 30 iterations of function
  for (j in 1:30) {
    cv_mse[30 * (i - 1) + j, 1] <- k_val[i] # elements of column are the ith k 
    cv_mse[30 * (i - 1) + j, 2] <- my_rf_cv(k_val[i]) # function for ith k
  }
}

colnames(cv_mse) <- c("k", "MSE")
cv_mse$k <- factor(cv_mse$k, levels = c("2", "5", "10"))

# create side-by-side boxplot of the cross-validation MSE for each k
ggplot(cv_mse, aes(k, MSE, fill = k)) +
  geom_boxplot() +
  labs(title = "CV Estimated MSE for Each k-Fold",
       x = "\nk-Folds", y = "Estimated MSE\n") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5, size = 12),
        legend.position = "none")

# empty vectors
avg_cv <- vector()
sd_cv <- vector()

# calculate mean and standard deviation of CV estimates for each k
for (i in 1:3) {
  avg_cv[i] <- mean(cv_mse[cv_mse$k == k_val[i], 2])
  sd_cv[i] <- sd(cv_mse[cv_mse$k == k_val[i], 2])
}
# create table of CV means and sds
cv_table <- cbind("k-Fold" = c(2, 5, 10), "Avg_CV" = avg_cv, 
                  "CV_SD" = sd_cv)
cv_table

Looking at both the boxplot and the table, the median CV estimates are not too different, but the variance increases as the number of folds increases. For k = 2, the standard deviation is r cv_table[1, 3] while for k = 10, the standard deviation is r cv_table[3, 3]. This happens because there is a bias-variance tradeoff when doing k-fold cross validation. Lower values of k have higher bias but lower variance, and as k approaches n, the number of training data, the estimates have lower bias but variance increases.

hchang23/firstpackage documentation built on March 18, 2020, 7:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com