In BenjaminLowry/project3package: Data Analysis Package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(ggplot2)
library(kableExtra)

This package contains 4 statistics functions for inference/prediction:

my_t.test

my_lm

my_knn_cv

my_rf_cv

Below are instructions for getting started:

Installation

You can download the package from GitHub:

``` {r, eval = FALSE}

install.packages("devtools")

devtools::install_github("BenjaminLowry/project3package", build_vignette = TRUE, build_opts = c())

Then import the package via:

```r
library(project3package)

my_t.test

Below are three examples of t-tests that can be performed with my_t.test using the lifeExp (life expectancy) data from my_gapminder which is included in the package:

Test #1: $H_0 : \mu = 60, H_a : \mu \ne 60$

test_1 <- my_t.test(my_gapminder$lifeExp, "two.sided", 60)
test_1

With $\alpha = 0.05$, the p-value indicates that this t statistic is not significant since r test_1$p_val > $\alpha$ and thus we do not reject the null hypothesis.

Test #2: $H_0 : \mu = 60, H_a : \mu < 60$

test_2 <- my_t.test(my_gapminder$lifeExp, "less", 60)
test_2

With $\alpha = 0.05$, the p-value indicates that this t statistic is significant since r test_2$p_val < $\alpha$ and thus we reject the null hypothesis.

Test #3: $H_0 : \mu = 60, H_a : \mu > 60$

test_3 <- my_t.test(my_gapminder$lifeExp, "greater", 60)
test_3

With $\alpha = 0.05$, the p-value indicates that this t statistic is not significant since r test_3$p_val > $\alpha$ and thus we do not reject the null hypothesis.

my_lm

Below is an example of using my_lm to perform a regression where my_gapminder$lifeExp is the response variable and my_gapminder$gdpPercap and my_gapminder$continent are the explanatory variables.

my_fit <- my_lm(lifeExp ~ gdpPercap * continent, 
                my_gapminder[c("lifeExp", "gdpPercap", "continent")])
my_fit

"gdpPercap" has a coefficient of r my_fit$Estimate[2] as shown in the table above. This means that if all other variables stayed the same (and thus we were ignoring the prescence or effect of the interaction variables in the table above) then for every increase in unit of GDP per capita in a country would correlate with an increase of r my_fit$Estimate[2] years of life expectancy.

The hypothesis test associated with this coefficient is: $H_0: \beta_1 = 0, H_a: \beta_1 \ne 0$ where $\beta_1$ is the coefficient for "gdpPercap". The p-value for the t statistic given by the fit is r my_fit[[4]][2] which is much much less than $\alpha = 0.5$ and thus we reject the null hypothesis that $\beta_1 = 0$.

lm_fit <- lm(lifeExp ~ gdpPercap * continent, 
             my_gapminder[c("lifeExp", "gdpPercap", "continent")])
mod_fits <- fitted(lm_fit)
my_df <- data.frame(actual = my_gapminder$lifeExp, fitted = mod_fits)
ggplot(my_df, aes(x = fitted, y = actual)) +
  geom_point() +
  geom_abline(slope = 1, intercept = 0, col = "red", lty = 2) + 
  theme_bw(base_size = 12) +
  labs(x = "Fitted values", y = "Actual values", title = "Actual vs. Fitted") +
  theme(plot.title = element_text(hjust = 0.5))

This plot shows that this fit is not great at explaining the relationship between the variables. Especially at small values of $\hat{y}$, the variance away from the fit is quite large. There are sections that match the fit quite well like from the fitted values from 70 to 80, but overall I think this graph shows that you would likely want to experiment with the model you are using to related life expectancy, GDP per capita, and continent since this fit is no where near a clean fit.

my_knn_cv

covariates <- my_penguins[c("bill_length_mm", "bill_depth_mm", 
                            "flipper_length_mm", "body_mass_g")]
class <- my_penguins[["species"]]
cv_errs <- vector(length = 10)
train_errs <- vector(length = 10)

for (i in 1:10) {
  cv_errs[i] <- my_knn_cv(covariates, class, i, 5)$cv_err
  # Count how many data points are misclassified using normal knn
  train_errs[i] <- 
    length(which(class::knn(covariates, covariates, class, i) != class)) / 
    nrow(my_penguins)
}

# Show table of errors
err_df <- cbind(1:10, data.frame(cv_errs), data.frame(train_errs))
colnames(err_df) <- c("k_nn", "cv_errs", "train_errs")
err_df

With the goal of minimizing error, I would choose the k_nn = 1 model when considering both the CV error and the training error since this is where the error is the lowest in both cases. However, in practice we know that training error is going to be zero with k_nn = 1 since the data is used for both the training and the testing, and that this model is likely overfitted for the given data and will not generalize well to new data. Choosing a cross validation model will be much better in this respect since it was trained with 5 folds which separate the training and test data to be disjoint sets (where every fold is the test set once and all other folds are the training data for that iteration of the cross validation) and thus in this way better trains the model to be more accurate when tested on data that it has never seen before. The best CV model based on the table above is the where k_nn = 1 (since it has lowest misclassification error) and so this is the model I would choose.

my_rf_cv

mse_matrix <- matrix(nrow = 90, ncol = 2)
ks <- c(2, 5, 10)
for (i in 1:3) {
  k <- ks[i]
  for (j in 1:30) {
    mse_matrix[(i-1)*30+j, 1] <- k
    mse_matrix[(i-1)*30+j, 2] <- my_rf_cv(k)
  }
}
mse_df <- data.frame(k = mse_matrix[,1], mse = mse_matrix[,2])

ggplot(data = mse_df, aes(group = k, x = k, y = mse)) + 
  geom_boxplot() +
  labs(title = "MSE by Number of Folds", x = "Number of Folds", y = "MSE") +
  theme_bw() + 
  theme(plot.title = element_text(hjust = 0.5))

# Get vectors of MSE by value of k
mse_k2 <- (mse_df %>% dplyr::filter(k == 2) %>% dplyr::select(mse))$mse
mse_k5 <- (mse_df %>% dplyr::filter(k == 5) %>% dplyr::select(mse))$mse
mse_k10 <- (mse_df %>% dplyr::filter(k == 10) %>% dplyr::select(mse))$mse

mse_stats_mat <- rbind(c(2, mean(mse_k2), sd(mse_k2)), 
                       c(5, mean(mse_k5), sd(mse_k5)), 
                       c(10, mean(mse_k10), sd(mse_k10)))

mse_stats_df <- data.frame(k = mse_stats_mat[,1],
                           mse_mean = mse_stats_mat[,2], 
                           mse_sd = mse_stats_mat[,3])
kable_styling(kable(mse_stats_df))

The box plot shows that first of all, the mean MSE decreases as the number of folds increases, with the largest decrease happening between k = 2 and k = 5. It also shows the range and standard deviation of the MSEs decreasing significantly as the number of folds increases. This is substantiated by the table which shows the mean dropping by r round((1 - (mse_stats_df$mse_mean[2] / mse_stats_df$mse_mean[1])) * 100, digits = 1)% from k = 2 to k = 5 and then dropping r round((1 - (mse_stats_df$mse_mean[3] / mse_stats_df$mse_mean[2])) * 100, digits = 1)% from k = 5 to k = 10, and also shows the standard deviation decreasing by about r round((1 - (mse_stats_df$mse_sd[2] / mse_stats_df$mse_sd[1])) * 100, digits = 1)% from k = 2 to k = 5 and then decreasing another r round((1 - (mse_stats_df$mse_sd[3] / mse_stats_df$mse_sd[2])) * 100, digits = 1)% from k = 5 to k = 10. This shows that the relative magnitude of the standard deviation's decrease is larger than the relative magnitude that the mean decreases by. I think that MSE decreases as number of folds increases perhaps because when there are fewer folds, the test data sets are quite large relative to the size of the training sets, which makes me think that there is not enough data to train the iteration of the cross validation in order to accurately predict the number of test data points that are given, and this lower accuracy results in higher MSE. This improves as the number of folds increase as the test set becomes small relative to all the data points that the model is trained on. I think this is also related to why the standard deviation of the MSE decreases because when there are fewer folds, if the data is split in a way where the training and test data is fairly different or fairly the same then you'll see relatively high and low MSE's, respectively. So the way the data is split can affect the MSE more when there are less folds compared to when there are more folds when the splitting isn't as impactful on the overall MSE.

BenjaminLowry/project3package documentation built on March 21, 2021, 4:34 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com