\addtolength{\headheight}{-.025\textheight} \thispagestyle{fancyplain} \rhead{\includegraphics[height=.1\textheight]{logo.png}} \renewcommand{\headrulewidth}{0pt}
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
I thank Bryan Martin for having tought me everything I know about R.
This package is my project 3 for UW's STAT302. And it maily contains 4 functions:
my_t_test
my_lm
my_knn_cv
my_fr_cv
Moreover we can find the less imprtant functions my_f_to_c
and my_pow
and the dataset my_iris
, which we won't discuss in this vignette. The package also contains my_gapminder
, the dataset we will use for the tutorials.
Here, we introduce myfirstpackage, a package developed in a statitistical softwer class. It contains some of the most basic and essential functions for statistical inference and predictions.
Note that in order to follow along with this tutorial you will need to have installed ggplot2.
Install myfirstpackage using:
devtools::install_github("MatteVin/myfirstpackage")
Then we need to load the following packages
library(myfirstpackage) library(ggplot2) library(dplyr)
The following totorial gives some practical demonstrations of how to use the functions and interpret the results.
my_t_test
can be used to do a one smaple Student's t-test. For this demonstration we use the variable lifeExp
from the my_gapminder
dataset.
We set the null hypotesis to be such that:
\begin{align}
H_0: \mu &= 60,\
\end{align}
Then we can inizialize and set
mu = 60
We use a p-value cut-off of $\alpha = 0.05$, which is the most commonly used. Then we run the first alternative hypothesys $H_a: \mu\neq60$
my_t_test(my_gapminder$lifeExp, alternative = "two.sided", mu)
We see that the resulting p-value is greater than $\alpha$ and thus we confirm the null hypotesis and reject the alternative one.
Then we run the thirs alternative hypothesys $H_a: \mu < 60$
my_t_test(my_gapminder$lifeExp, alternative = "less", mu)
We see that the resulting p-value is less than 0.05 and thus we confirm the null hypotesis and reject the alternative one. Then we run the first alternative hypothesys $H_a: \mu>60$
my_t_test(my_gapminder$lifeExp, alternative = "greater", mu)
We see that the resulting p-value is greater than 0.05 and thus we confirm the null hypotesis and reject the alternative one.
We will demonstrate how to use my_lm
to fit a linear model to some data. To do so we will use my_gapminder
, the variable lifeExp
will be the rsponse variable, while gdpPercap
and continent
will be used as explanatory variables
test <- my_lm(lifeExp ~ gdpPercap + continent, data = my_gapminder) test
We can oberve that the gdpPercap coefficient appears to be positive but small, only r test[2,1]
. What we need to keep in mind when we fit any linear model is that the magnetude of a coefficient can be missleading. We can see how gdpPercap
is in dollars, while lifeExp
is in years. What this model shows is that for each thousand of dollars increase in gdp per capita we can observe an average increase of almost half a year.
We first need to set a $H_0$ and a $H_a$.
$$H_0: coef = 0$$
$$H_a: coef \neq 0$$
Then, we will test it using the significant level of 0.05.
$$ \alpha = 0.05$$
We can see that the p-value, obtained from my_lm
, less than the $\alpha$, so we can reject the null hypothesis and accept the alternative one.
Now we plot actial Vs fitted values.
my_coef <- test[, 1] my_matrix <- model.matrix(lifeExp ~ gdpPercap + continent, data = my_gapminder) y_hat <- my_matrix %*% as.matrix(my_coef) my_data <- data.frame("Actual" = my_gapminder$lifeExp, "Fitted" = y_hat, "Continent" = my_gapminder$continent) ggplot(my_data, aes(x = Actual, y = Fitted, color = Continent)) + geom_point() + geom_abline(slope = 1, intercept = 0, col = "black") + labs(title = "Actual vs. Fitted", x = "Actual values", y = "Fitted values") + theme_bw(base_size = 15) + theme(plot.title = element_text(hjust = 0.5, face = "bold",size = 15))
From the graph we can observe that the fit is good for Oceania amd most of Europe, while it does farely bad in Africa, America and Asia. We can thus notice that the poorer on average a continent is the worst the fit was, which means that our lineal model works mostly with high gdpPercap values. The reason is that the relationship between lifeExp
and gdpPercap
is not linear. Indeed when running regressions using gdpPercap we often take it's log and use it in place, which I believe in this case would have provided a better fit.
We will now show houw to use my_knn_cv
to predict output class continent using covariates gdpPercap
and lifeExp
from my_gapminder
.
We will use 5 folds cross validation
k_cv <- 5
Then we run the function changing the parameter for the number of neigboors, iterating from 1 to 10. We will record the training misclassification rate and the CV misclassification rate (which is outputed by my_knn_cv
).
error_train <- rep(NA, 10) error_cv <- rep(NA, 10) for(i in 1:10){ output <- my_knn_cv(train = my_gapminder[,5:6], cl = my_gapminder$continent, k_nn = i, k_cv = k_cv ) error_train[i] <- sum(output$class != my_gapminder$continent) / length(output$class) error_cv[i] <- output$cv_err } result <- data.frame("k_nn" = c(1:10), "training_error" = error_train, "cv_err" = error_cv) result
Looking at the table, the CV missclassification error is decreasing as the number of
neighbors increases. It appears to be farely constant, with only a change r max(result[,3]) - min(result[,3])
,
thus we would chose k_nn = 10.
On the other hand training error increases as k_nn increases, thus we would choose knn = 1.
The reson why the two measures have opposite trands is that when one inceases the other decreases.
Think of the CV misclassification rate as a measure of how generalizable to other data the model is.
While the training error is a measure of how accurately the model predicts the data it has been trained on.
The more neigboors you take into consideration the more your model will be generalizable (low cross validation error), but then it will get less precise, and training error will increase.
Overall I believe that the model is quite weak, scoring always more well above 50% cross validation error. In this case, since the cross validation error is quite high regardless on the number of neighboors, I would use a small k_nn value to try keeping training error down, accepting to have the CV error high. I would probably choose k_nn = 2.
We will use my_rf_cv
to predict lifeExp using covariate gdpPercap.
We iterate through k in c(2, 5, 10):For each value of k, we run the function 30 times and record the MSE for each run.
.
mse_cv <- matrix(nrow = 90, ncol = 2) for(i in 1:30){ mse_cv[i, 2] <- my_rf_cv(2) mse_cv[i + 30, 2] <- my_rf_cv(5) mse_cv[i + 60, 2] <- my_rf_cv(10) mse_cv[i, 1] <- "k = 2" mse_cv[i + 30, 1] <- "k = 5" mse_cv[i + 60, 1] <- "k = 10" } mse_cv_data <- as.data.frame(mse_cv) colnames(mse_cv_data) <- c("k", "cv_mse") ggplot(mse_cv_data, aes(x = k, y = cv_mse, group = k, fill = factor(k))) + geom_boxplot() + labs(title = "MSE of k folds", x = "Number of Folds", y = "MSE", fill = "Number of Folds") + theme_bw(base_size = 15)
mse_2 <-as.numeric(mse_cv_data[1:30, 2]) mse_5 <-as.numeric(mse_cv_data[31:60, 2]) mse_10 <-as.numeric(mse_cv_data[61:90, 2]) result <- data.frame("k_nn" = c("k=2", "k=5", "k=10"), "mean_mse" = c(mean(mse_2), mean(mse_5), mean(mse_10 ) ), "sd_mse" = c(sd(mse_2), sd(mse_5), sd(mse_10) ) ) result
We can see the cross validation error decrese as the number of folds increases, in both the table and the graph.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.