In SylviaDu99/Project3Package: Package Building Project

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Information

Special thank to Bryan D. Martin for his teaching in STAT 302 and kindly providing us with the example we use for this vignette.

Introduction

This package is a final project of STAT 302 in UW containing 4 functions we developed during the quarter:

my_t.test my_lm my_knn_cv my_rf_cv

that allow us to analyze data with multiple methods.

The package also includes a dataset:

my_gapminder

taken from the famous gapminder dataset.

To install \testtt{Project3Package}, use the code below:

devtools::install_github("SylviaDu99/Project3Package")

To begin, we need to load the following packages and sample dataset:

library(Project3Package)
library(magrittr)
library(ggplot2)
library(dplyr)
library(randomForest)
library(class)
library(stats)
data(my_gapminder)

If you are unfamiliar with my_gapminder, you can view a description of the data using:

?my_gapminder

Tutorials

In this tutorial, how each of the function works is shown.

my_t.test

The function my_t.test can be used to perform a one-sampled t test to show whether it is able to reject the null hypothesis.

For demonstration, we use lifeExp from my_gapminder.

We are want to test the null hypothesis that the mean value of lifeExp is 60 $$H_0: \mu = 60$$ with significant level equals to 0.05. $$\alpha = 0.05$$

The first example is a two sided t test $$H_a: \mu \neq 60$$

my_t.test(my_gapminder$lifeExp, "two.sided", 60)

From the output list, the p value is greater than the significant level 0.05, so we cannot reject the null hypothesis.

The second example is a lower tailed t test $$H_a: \mu < 60$$

my_t.test(my_gapminder$lifeExp, "less", 60)

From the output list, the p value is smaller than 0.05, so we can reject the null hypothesis.

The third example is an upper tailed t test $$H_a: \mu > 60$$

my_t.test(my_gapminder$lifeExp, "greater", 60)

From the output list, the p value is greater than 0.05, so we cannot reject the null hypothesis.

my_lm

The function my_lm can be used to fit a linear model to the input data.

For demonstration, we use data from my_gapminder. Let lifeExp to be the response variable, gdpPercap and continent to be the explanatory variables.

my_model <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder)
my_model

From the output table, we can see that the coefficient (the Estimate column) of gdpPercap has value about 0.000445, which means lifeExp increases 0.000445 as gdpPercap increases 1.

Set $H_0$ and $H_a$ to: $$H_0: coefficient = 0$$ $$H_a: coefficient \neq 0$$ and the set the significant level to: $$ \alpha = 0.05$$ From the output table by my_lm, we can find the p value (the Pr(>|t|) column).
Since the p value is much smaller than 0.05, we can reject the null hypothesis.

To visualize how well the model fits, we can plot the Actual vs Fitted values

formula <- lifeExp ~ gdpPercap + continent
f <- model.frame(formula, my_gapminder)
# extract the explanatory variable x 
x <- model.matrix(formula, my_gapminder)
# extract the response variable y
y <- model.response(f) %>% as.matrix()
# y_hat = x * coeff + se
y_hat <- x %*% as.matrix(my_model[,1]) + my_model[,2]
my_df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = y_hat, "continent" = my_gapminder$continent) 
ggplot(my_df, aes(x = fitted, y = actual, color = continent)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0) +
  labs(title = "Actual vs Fitted Values", x = "Fitted Values", y = "Actual Values", color = "Continent") +
  theme_bw(base_size = 15) +
  theme(plot.title = element_text(hjust = 0.5),
        legend.title = element_text(hjust = 0.5))

From the visualization, we can see that this model have better fit in Europe and Oceania than other in continents. In order to get a better fitted model, we need to also try other algorithms.

my_knn_cv

The function my_knn_cv can be used to predict the class value by k-fold Cross-validation and k-nearest neighbors algorithms.

For demonstration, let continent to be class value that we are predicting by using covariates gdpPercap and lifeExp.

train <- my_gapminder %>% select(gdpPercap, lifeExp)
cl <- my_gapminder$continent
result <- matrix(NA, nrow = 10, ncol = 2)
for (i in 1:10) {
  output <- my_knn_cv(train, cl, i, 5)
  # cv misclassfication rate
  result[i, 1] <- output$cv_err
  # training misclassification rate
  result[i, 2] <- output$te
}
result <- data.frame("number of neighbors" = c(1:10), 
                     "CV misclassification rate" = result[, 1],
                     "training misclassification rate" = result[, 2])
result

Leave-One-Out Cross-Validation is an algorithm that splits the data into k different folds randomly, where k-1 folds are used to train the model and the left fold is used to test the model. The process repeats to let each fold to be used as test data. It is useful because we can use it to find a model that minimizes both the training error and the test error, by which we could generate better predictions.

For number of neighbors = 1, training misclassification rate is always 0 and CV misclassification rate is about 0.55, which is (almost) the highest CV misclassification rate among all results from number of neighbors = 1 to 10; for number of neighbors = 10, training misclassification rate is about 0.4 and CV misclassification rate is about 0.48, this case has the lowest CV misclassification rate and the highest training misclassification rate. Based on the training misclassification rates, we would choose the model with number of neighbors = 1. If based on cv misclassification rates, we would choose the model with number of neighbors = 10. In practice, the model with number of neighbors = 5 would be a better choice since both the training misclassification rate and the CV misclassification rate are relatively low.

my_rf_cv

The function my_rf_cv can be used to predicts the lifeExp in \code{my_gapminder} dataset by gdpPercap with k-fold Cross-validation and random forest algorithms.

For demonstration, let k be 2, 5, 10 and iterate each case 30 times.

cv_error <- matrix(NA, nrow = 90, ncol = 2)
row <- 1
for(i in 1:30) {
    cv_error[i, 1] <- 2
    cv_error[i, 2] <- my_rf_cv(2)
    cv_error[i + 30, 1] <- 5
    cv_error[i + 30, 2] <- my_rf_cv(5)
    cv_error[i + 60, 1] <- 10
    cv_error[i + 60, 2] <- my_rf_cv(10)
}
cv_error

my_df <- data.frame("k" = cv_error[, 1], "mse" = cv_error[, 2])
ggplot(my_df, aes(x = k, y = mse, group = k, fill = factor(k))) +
  geom_boxplot() +
  labs(title = "MSE of k folds", x = "Number of Folds (k)", y = "MSE", 
       fill = "Number of Folds") +
  theme_bw(base_size = 15) +
  scale_x_continuous(breaks = c(2, 5, 10)) +
  theme(plot.title = element_text(hjust = 0.5),
        legend.title = element_text(hjust = 0.5, size = 10))

my_t <- my_df %>% 
  group_by(k) %>%
  summarise(mean = mean(mse), sd = sd(mse)) %>%
  select(-k) %>%
  as.matrix() %>%
  as.table()
rownames(my_t) <- c("k = 2", "k = 5", "k = 10")
my_t

According to the boxplot and the table, as k increases, the range of mse and the standard deviation decreases, while the mean and the median of mse increases. The larger the range, the larger the standard deviation is of the data, which explains why 2-fold case has the largest sd and the 10-fold case has the lowest sd. The more folds we create, the more the iterations of the data help to generate a better model, and the prediction would be more accurate.

SylviaDu99/Project3Package documentation built on March 24, 2020, 5:29 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com