In txqtiffany/STAT302package: Package Building Demonstration

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The goal of STAT302package is to demonstrate my learning on how to develop a well-documented, well- tested, and well-explained R package from the course of STAT 302.

Thank you to our instructor Bryan for teaching this course with great passion!

Installation

You can install the STAT302package using the following line:

``` {r install, eval = FALSE} devtools::install_github("txqtiffany/STAT302package", build_vignette = TRUE, build_opts = c())

## A tutorial for `my_t.test`

This is a basic example which shows you how to use `my_t.test`:

```r
library(STAT302package)
t.test_1 <- my_t.test(my_gapminder$lifeExp, "two.sided", 60)
t.test_1$p_val
t.test_2 <- my_t.test(my_gapminder$lifeExp, "less", 60)
t.test_2$p_val
t.test_3 <- my_t.test(my_gapminder$lifeExp, "greater", 60)
t.test_3$p_val

With the p-value cut-off of α = 0.05, we can conclude that for the first test, since r round(t.test_1$p_val, 3) > 0.05, we cannot reject the null hypothesis and there is insufficient evidence to support that the true mean of life expectency is not equal to 60.

For the second test, since r round(t.test_2$p_val, 3) < 0.05, we can reject the null hypothesis and there is sufficient evidence to support that the true mean of life expectency is less than 60.

For the third test, since r round(t.test_3$p_val, 3) > 0.05, we cannot reject the null hypothesis and there is insufficient evidence to support that the true mean of life expectency is greater than 60.

Therefore, at the 5% significance level, there is sufficient evidence to support the claim that the true mean of life expectency is less than 60.

A tutorial for `my_lm`

This is a basic example which shows you how to use my_lm:

library(STAT302package)
model <- my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder)
print(round(model, 5))

From the statistics above, we know that the expected difference in the lifeExp for each unit difference in gdpPercap, holding all other covariates constant, is r round(model[2, 1], 5). Comparing to the coeffiecients of different continents, differences in gdpPercap have less influence on lifeExp than differences in continent, indicating that we might want to test its significance using a two-sided hypothesis test.

In other words, we are testing to see if the true mean of gdpPercap coefficient is 0. The null hypothesis is that it equals to 0, and the alternative hypothsis is that it doesn't equal to 0.

model[2, 4] < 0.05

Within the model, we can tell by the fourth column, which is our p values, that r round(model[2, 4], 5) is less our cutoff of α = 0.05.

Therefore, we reject the null coefficient that the mean of gdpPercap is 0, indicating that it is indeed significance. The reason behind why the number is so small might be due to its large scale of numbers that gdpPercap has comparing to lifeExp.

Now, we will use ggplot2 to plot the Actual vs. Fitted values.

my_estimates <- as.matrix(model[,"Estimate"])
x_mat <- model.matrix(lifeExp ~ gdpPercap + continent,
                      STAT302package::my_gapminder)
yhat <- x_mat %*% my_estimates
my_df <- data.frame("actual" = my_gapminder$lifeExp, "fitted" = yhat, "continent" = my_gapminder$continent)

ggplot2::ggplot(my_df, ggplot2::aes(x = fitted, y = actual, color = continent)) +
  ggplot2::geom_point() +
  ggplot2::geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) +
  ggplot2::theme_bw() +
  ggplot2::labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") +
  ggplot2::theme(plot.title = ggplot2::element_text(hjust = 0.5))

From the graph, we can see that our model is not a very good fit from the actual values with quite a high variance that deviates from the line. Especiallly for countries in Africa since the continent Africa was not incorporated as part of this model. This shows that the correlation between lifeExp and the gdpPercap, continent factors are not strong enough, and that there are other factors influencing life Expectency.

A tutorial for `my_knn_cv`

This is a basic example which shows you how to use my_knn_cv:

library(STAT302package)
library(kableExtra)
penguins_df <- na.omit(my_penguins)
train <- lapply(penguins_df[c("bill_length_mm", "bill_depth_mm",
                              "flipper_length_mm", "body_mass_g")], as.numeric)
cl <- as.numeric(penguins_df$species)
train_err <- c()
cv_err <- c()
for (i in 1:10) {
  knn_model <- my_knn_cv(train, cl, i, 5)
  cv_err[i] <- knn_model$cv_err
  train_err[i] <- sum(knn_model$class != cl) / length(my_penguins$species)
}
err_table <- data.frame("k_nn" = c(1:10), "train_err" = train_err, "cv_err" = cv_err)
kable_styling(kable(err_table))

Base on the training misclassification, besides the trivial set of k_nn = 1 where the error trivially equals to 0, I would choose k_nn = 2 with the lowest error.

Base on the CV misclassification rates, I would choose k_nn = 1 with the lowest error.

However, in practice, I would choose k_nn = 7 because that is an optimal point where both the training and misclassification errors dropped as K increase, and that the number of neighbors isn't too small for it to be overfitting.

Here, we're using cross-validation here which is a technique that is used to protect against overfitting in a predictive model. We do this by making a fixed number of folds (in this case, 5) of the data, run the analysis on each fold, and then average the overall error estimate.

A tutorial for `my_rf_cv`

This is a basic example which shows you how to use my_rf_cv:

library(STAT302package)
library(ggplot2)
library(reshape2)
library(kableExtra)
cv_mse <- matrix(NA, 30, 10)
for (i in c(2, 5, 10)) {
  for (j in 1:30) {
    cv_mse[j, i] <- my_rf_cv(i)
  }
}
cv_mse_df <- as.data.frame(cv_mse[ , c(2, 5, 10)])
colnames(cv_mse_df) <- c("k_2", "k_5", "k_10")
cv_mse_df_long <- melt(cv_mse_df)
ggplot(data = cv_mse_df_long,
                aes(x = variable, y = value)) +
  geom_boxplot(fill = "lightblue") +
  theme_bw() +
  labs(title = "CV estimated MSE by value of k",
       x = "Number of folds",
       y = "MSE") +
  theme(plot.title =
          element_text(hjust = 0.5))
mse_table <- data.frame("Mean" = c(mean(cv_mse_df$k_2),
                                   mean(cv_mse_df$k_5),
                                   mean(cv_mse_df$k_10)),
                        "Standard Deviation" = c(sd(cv_mse_df$k_2),
                                                 sd(cv_mse_df$k_5),
                                                 sd(cv_mse_df$k_10)))
row.names(mse_table) <- c("k = 2", "k = 5", "k = 10")
kable_styling(kable(mse_table))

Looking at the table, we can see that when k = 1, the mean as well as the standard deviation for CV estimated MSE is the highest, and the two numbers decreases as k increases. From the boxplot, we see that the range of MSE becomes more narrow and that the media decreases as k increases. However, thinking of the bias-variance tradeoff, as we increase the complexity of the model, meaning the higher number of folds, the bias decreases yet the variance should increase.

Although this table shows decreasing standard deviation/variance, my analysis would be due to the shape of the variance for cross-validation being more towards a u-shape than a line. With higher and higher folds, the variance should increases later because as k approaches n, the folds would become highly correlated.

txqtiffany/STAT302package documentation built on Dec. 23, 2021, 1:03 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com