In jacklorrengriffin/myfirstpackage: Package For Stat 302

title: "Project 3: myfirstpackage Tutorial" author: "Jack L Griffin" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{myfirstpackage Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(myfirstpackage)

This package is a data analysis package including a few functions that can help the user better understand and interpret their data. my_t.test performs a simple t-test, my_lm creates a linear model, my_knn_cv makes a prediction of classification based on data, and my_rf_cv predicts object variables based on covariates. The following code will install the package from github:

# install.packages("devtools")
# library(devtools)
# devtools::install_github("jacklorrengriffin/myfirstpackage")
library(myfirstpackage)
library(dplyr)

load gapminder data for tutorials

data(my_gapminder)

my_t.test tutorial

my_t.test(my_gapminder$lifeExp, "two.sided", 60)
my_t.test(my_gapminder$lifeExp, "less", 60)
my_t.test(my_gapminder$lifeExp, "greater", 60)

The above figures are the results for our t tests. At a p value cut off of 0.05, both our two sided test and our greater than test fail, with results greater than 0.05. The less than test returns a value of 0.047, which means that our results are significant at this cutoff. Therefore, we can conclude that the data rejects the null hypothesis and supports the alternative hypothesis, suggesting that the mean life expectancy is less than 60.

my_lm tutorial

my_lm(formula = lifeExp ~ gdpPercap + continent, data = my_gapminder)

library(ggplot2)
lm_output <- predict(lm(lifeExp ~ gdpPercap + continent, my_gapminder))
comparison <- data.frame("life_expectancy" = my_gapminder[["lifeExp"]], lm_output,
                         "color" = my_gapminder[["continent"]])

ggplot(data = comparison, aes(x = life_expectancy, y = lm_output, color = color)) + geom_point() + ylab("Fitted") + xlab("Actual")

Using my_lm, we demonstrate a regression using life expectancy as our response variable and gdp per capita and continent as our explanatory variables. Our gdp per capita coefficient is 4.45e-04, meaning that when gdp per capita increases by 1, life expectancy increases by about .0004 years.

Hypothesis test for gdp per capita and life expectancy: HO: GDP per capita has no significant correlation with life expectancy HA: The relationship between GDP per capita and life expectancy is statistically significant.

Using a level of 0.05, we can easily reject the null hypothesis for gdp per capita, as the p value is extremely small, suggesting that the data is statistically significant.

The figure we created showing the difference between the actual life expectancy and that predicted based on our model is also shown above. There is much less variability in the actual data than there is in our fitted model, likely because our model is overvaluing the only data we have given it, leading to too extreme of rpedictions.

my_knn_cv tutorial

data(my_penguins)
my_penguins <- na.omit(my_penguins)

train <- my_penguins[,3:6]
cl <- my_penguins %>% pull(species)

k1 <- my_knn_cv(train, cl, 1, 5)
k2 <- my_knn_cv(train, cl, 2, 5)
k3 <- my_knn_cv(train, cl, 3, 5)
k4 <- my_knn_cv(train, cl, 4, 5)
k5 <- my_knn_cv(train, cl, 5, 5)
k6 <- my_knn_cv(train, cl, 6, 5)
k7 <- my_knn_cv(train, cl, 7, 5)
k8 <- my_knn_cv(train, cl, 8, 5)
k9 <- my_knn_cv(train, cl, 9, 5)
k10 <- my_knn_cv(train, cl, 10, 5)
knn_results <- c(k1, k2, k3, k4, k5, k6, k7, k8, k9, k10)

print(knn_results)

ased on the training misclassification rates I would chose k=1, but based on the cv misclassification rates I would chose k=10. This makes sense because k=1 is likely overfit to the data, making it extremely accurate, but not as generalizable. As a result, I would choose knn = 10 in practice. This is a good demonstration of why cross fold validation is useful. By splitting the data into folds and using each one as training and the others as tests, it avoids overfitting to a specific data set and minimizes mean squared error.

my_rf_cv tutorial

data(my_penguins)
my_penguins <- na.omit(my_penguins)

cv_ests <- data.frame(rep(0, 90), rep(0, 90))
colnames(cv_ests) <- c("k", "cv")

for (i in 1:30) {
    cv_ests[i, 1] <- "2"
    cv_ests[i, 2] <- my_rf_cv(2)
}
for (i in 31:60) {
    cv_ests[i, 1] <- "5"
    cv_ests[i, 2] <- my_rf_cv(5)
}
for (i in 61:90) {
    cv_ests[i, 1] <- "10"
    cv_ests[i, 2] <- my_rf_cv(10)
}

ggplot(data = cv_ests, aes(x = k, y = cv, group = k)) +
  geom_boxplot() +
  labs(title = "k-Fold Cross-Validation Estimate") +
  theme_bw()

data_2 <- c(mean(cv_ests[which(cv_ests$k == "2"), ]$cv),
            sd(cv_ests[which(cv_ests$k == "2"), ]$cv))
data_5 <- c(mean(cv_ests[which(cv_ests$k == "5"), ]$cv),
            sd(cv_ests[which(cv_ests$k == "5"), ]$cv))
data_10 <- c(mean(cv_ests[which(cv_ests$k == "10"), ]$cv),
            sd(cv_ests[which(cv_ests$k == "10"), ]$cv))

table <- matrix(c(data_2, data_5, data_10), nrow = 3, ncol = 2, byrow = TRUE)
rownames(table) <- c("k = 2", "k = 5", "k = 10")
colnames(table) <- c("Mean", "SD")
as.table(table)

Our final tutorial is for my_rf_cv. We predict body mass using bill length, bill depth, and flipper length using k = 2, 5, and 10. The box plots clearly show that as the number of folds increases, the standard deviation decrasesand and mean both decrease. This is likely because the more folds we include, the more accurate the predicitons become and the less likely they are to overfit.

jacklorrengriffin/myfirstpackage documentation built on Dec. 20, 2021, 8:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com