In hugoliao330/stat302util: Utilities for STAT302 at UW

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introductory

The package \code{stat302util} contains the functions that are taught and built in the STAT302 course. The functions contained make inferences and/or predictions based on the input data. Namely, this package has the ability to perform a t-test, to fit linear models, to find the k-nearest neighbor cross-validatory classification, and to make random forest cross-validation prediction.

This package is built following Bryan D Martin's instruction from \link[https://bryandmartin.github.io/STAT302/docs/LectureSlides/lectureslides9/lectureslides9.html#1].

Install \texttt{stat302util} using:

devtools::install_github("hugoliao330/stat302util", build_vignette = TRUE, build_opts = c())

To begin, we load our package from library:

library(stat302util)

Tutorial for my_t.test

This function gives statistics of the t-score, degree of freedom, and the p-value.

To start, we will use the \code{lifeExp} data from \code{my_gapminder}.

mgm <- my_gapminder
le <- mgm$lifeExp

To test the hypothesis on the \code{lifeExp} data where \begin{align} H_0: \mu &= 60,\ H_a: \mu &\neq 60. \end{align}

my_t.test(x = le, alternative = "two.sided", mu = 60)

With an alpha threshold of 0.05, this means that we should not rejected the null hypothesis as the p-value exceeds the threshold, and the null hypothesis holds.

To test the hypothesis where \begin{align} H_0: \mu &= 60,\ H_a: \mu &< 60. \end{align}

my_t.test(x = le, alternative = "less", mu = 60)

With $\alpha = 0.05$, we reject the null hypothesis, meaning that the true mean of life expectancy of the population should be less than 60 years old.

To test the hypothesis where \begin{align} H_0: \mu &= 60,\ H_a: \mu &> 60. \end{align}

my_t.test(x = le, alternative = "greater", mu = 60)

In this case, we definitely cannot reject the null hypothesis, as this is the exact opposite of the last hypothesis we tested.

For more information, you can see the documentation of this function by typing this in your console:

?my_t.test

Tutorial for my_lm

This function fits the data with a linear regression model on response variable and explanatory variables.

I will demonstrate this function with data from \code{my_gapminder}, where \code{lifeExp} would be my response variable and \code{gdpPercap} and \code{continent} as explanatory variables.

ml <- my_lm(formula = lifeExp ~ gdpPercap + continent, mgm)
ml

The coefficient for \code{gdpPercap} is r ml$Estimate[2], which is fairly low compared to the coefficients for \code{continent}. This number indicates the correlation between \code{gdpPercap} and the response variable \code{lifeExp}.

We can propose a hypothesis test associated with the \code{gdpPercap} coefficient, where the null hypothesis is that the coefficient of co-variate \code{gdpPercap} is zero, meaning there is no correlation at all, whereas the alterantive hypothesis would be that the coefficient is not zero, meaning there is a correlation.

With the information given by the result of \code{my_lm}, and given a p-value cut-off of $\alpha = 0.05$, we can safely conclude that the null hypothesis is rejected and that the alternative hypothesis holds because of the low p-value of r ml$Pr...t..[2].

We can use a ggplot 2 to visualize the actual verses fitted values. First, let's load the \code{ggplot2} package

library(ggplot2)

Tutorial for my_knn_cv

Again, I will demonstrate this function using data from \code{my_gapminder}.

We can predict output class \code{continent} using covariates \code{gdpPercap} and \code{lifeExp} with a 5-fold cross validation. I will also iterate from 1 to 10 for the kth-nearest-neighbor for this function, in an effort to compare the training misclassification rate as well as the CV misclassification rate.

train <- data.frame(mgm$gdpPercap, mgm$lifeExp)
true_cl <- mgm$continent
train_mc <- rep(NA, length = 10)
cv_mc <- rep(NA, length = 10)

for (i in 1:10) {
  result <- my_knn_cv(train = train, cl = true_cl, k_nn = i, k_cv = 5)
  train_mc[i] <- result$cv_err
  cv_mc[i] <- mean(true_cl != result$class)
}

# show results
train_mc
cv_mc

Based on training misclassification rate I would choose around \code{k_nn = 8}. Based on the CV misclassification rate I would choose \code{k_nn = 1}. However, choosing the optimal \code{k_nn} value based on true misclassificaiton rates in such a small sample is not reasonable. In practice, we should choose based on the training misclassification rate because it better represents how my model will react to data excluded by the training data. Cross-validation is able to acheive this because it randomly splits the training data in \code{k_cv} folds. Everytime, it uses all but one fold of data to train the model and then to fit it to the last fold of data. This increases flexibility of the model and thus accuracy in unknowns.

Tutorial for my_rf_cv

As usual, I will use the data from \code{my_gapminder} to demonstrate the functionalities.

Let's predict the \code{lifeExp} using covariate \code{gdpPercap}.

mse <- matrix(NA, nrow = 30, ncol = 3)
k <- c(2, 5, 10)
for (j in 1:3) {
  for (i in 1:30) {
    mse[i, j] <- my_rf_cv(mgm["gdpPercap"], mgm["lifeExp"], k[j])
  }
}
colnames(mse) <- c("2", "5", "10")

Using a ggplot to display the result from last chunk.

dat <- stack(as.data.frame(mse))
ggplot(dat) +
  geom_boxplot(aes(x = ind, y = values)) +
  labs(title = "MSE with respect to k", x = "k", y = "value")

We can also take a look at the average and the standard deviation of the data.

df <- data.frame("mean" = c(mean(mse[,1]), mean(mse[,2]), mean(mse[,3])), 
           "standard deviation" = c(sd(mse[,1]), sd(mse[,2]), sd(mse[,3])))
rownames(df) <- c("k=2", "k=5", "k=10")
df

It is pretty clear from both the box plot as well as the table that the standard deviation is the smallest with a \code{k} value of 10. It surely seems that as \code{k} increases, the narrower the distribution becomes. This means that with a higher \code{k}, \code{my_rf_cv} is able to predict more accurately with less error, which is intuitively reasonable. Folding the training dataset in more partitions allows the model to be more precise with more training data to look at, despite the expense of computing power.

hugoliao330/stat302util documentation built on March 18, 2020, 12:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com