knitr::opts_chunk$set(echo = TRUE)
We start by loading the packages we need for performing the analysis, here I've chosen to work the nycflights13, tidyverse and caret. nycflights13 data set, contain different 5 different data sets for air traffic from NYC during 2013 which is the data we will work with, tidyverse is package loading several different useful packages that are related to tidyr and dplyr and last but not least caret, which we use for building the prediction models.
Also, we set the seed, which tells us where the random iteration starts, so we can reproduce our results. If we wouldn't set seed, the script would generate different results everytime we run the script.
# Packages ---------------------------------------------------------------- library(nycflights13) library(caret) library(tidyverse) library(linreg) set.seed(1991) # Read data --------------------------------------------------------------- flights <- nycflights13::flights weather <- nycflights13::weather
We start off the analysis by merging our two different data sets into one, but before merging, we remove some time variables that are present in both files to avoid duplicates in the merged file.
# Removed duplicated variables time stamps from weather (variables exists in bith weather and flights data) weather <- weather %>% dplyr::select(-year, -month, -day, -hour) # Join flights and weather data df <- left_join(flights, weather, by = c("origin", "time_hour"))
When analysing the merged data frame we can see that there are several flights that are missing weather data. When we develop our prediction model, we want our data set to be tidy (one variable in each column, one observation in each row and no missing values). So to make sure there aren't just one airport or month that have all missing data we compute the weather_NA data frame.
# Flight missing weather information is distributed on all airports and months weather_NA <- df %>% filter(is.na(temp)) %>% group_by(origin, month) %>% summarise(missing = n()) %>% spread(month, missing) weather_NA
Since all airport have missing weather data and most months over the year (even if it more common in winter), we decide to remove all flights with missing weather data.
After a quick summary, we can see that there are a lot of variables that misses a few observations so we decide to remove these from the analysis as well (Output have been hidden in vignette).
lapply(df, FUN = function(x){summary(x)})
Finally we make a random subset of the data to reduce the volume, just for increasing calculation speed (this is something we only do for this exercise.)
# Delete all cases with missing value df <- df[complete.cases(df), ] %>% dplyr::select(-year) # Reducing the amount of data used for analysis df <- df %>% sample_frac(.01)
We need to decide what covariates we want to include in the model. To get a hunch of what variables that could have "predictive power" for flight delays, we make compute the correlations between all variables.
We construct one model with all variables which have an absolute correlation above 0.15.
We also make some additional data cleaning, removing all variabels we believe won't improve the model.
# Select variables with predictive value cor_matrix <- cor(df[sapply(df, is.numeric)]) has_effect <- cor_matrix[cor_matrix[,"arr_delay"] >= abs(0.15) ,"arr_delay"] formula_short <- arr_delay ~ dep_delay + dep_time + hour # Remove variables without effect df <- df %>% dplyr::select(-c(sched_dep_time, carrier, flight, tailnum, dest, minute, time_hour, wind_gust))
Now when the data is prepared, we can start building our prediction model.
First we partition the data set into 3 different groups, training, validation and testing.
Training data set is used to train the model for finding the optimal parameters. This data set contain 80% of the observations.
Validating data set is used to validate the result of our different models we develop. This data set contains 15 % of the observations.
Test data set is used for our final evaluation of our best prediction model. This data set contains 5% of the observations.
By splitting up the data set in different partitions, reduces the risk of overfitting the model, due to the fact that the data we validate haven't been used in the training process of our algorithm. The reason for splitting the testing and validating data is to make sure we don't tweak our model too much so it actually fit the validation set "better" than the real population of data. The test data set is only used once when you have developed a model you're satisfied with.
We also specify the resampling method. Here we uses 10-fold crossvalidation of the data repeated 3 times.
# Create test and training data ------------------------------------------- set.seed(1991) # Create train and test data in_train <- createDataPartition(y = df$arr_delay, p = .8, list = FALSE) training <- df[in_train, ] tmp <- df[-in_train, ] # create validate and test data in_validate <- createDataPartition(y = tmp$arr_delay, p = .75, list = FALSE) validate <- tmp[in_validate, ] testing <- tmp[-in_validate, ] rm(tmp) # Set resampling method --------------------------------------------------- ctrl <- trainControl(method = "repeatedcv", repeats = 3)
Ridgereg_model <- list( type = c("Regression"), library = "linreg", loop = NULL, prob = NULL) Ridgereg_model$parameters <- data.frame(parameter = "lambda", class = "numeric", label = "lambda") Ridgereg_model$grid <- function (x, y, len = NULL, search = "grid"){ data.frame(lambda = seq(0,2,by=0.25)) } Ridgereg_model$fit <- function (x, y, wts, param, lev, last, classProbs, ...){ dat <- if (is.data.frame(x)) x else as.data.frame(x) dat$.outcome <- y output <- ridgereg$new(.outcome ~ ., data = dat, lambda = param$lambda, ...) output } Ridgereg_model$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL){ if (!is.data.frame(newdata)) newdata <- as.data.frame(newdata) modelFit$predict(newdata) }
The first model we build contains all variables in our data set. Since ridge regression penalizes for a lot of covariates, we can include all variables without overfitting the model.
# Model 1 # Train model start <- Sys.time() lmr_fit1 <- train(arr_delay ~ ., data = training, method = Ridgereg_model, trControl = ctrl) end <- Sys.time() end-start # Full data 15 min to run, lambda = 0, Rsquared = 87.124% MAE 10.87018 # Predict lmr_class1 <- predict(lmr_fit1, newdata = validate) # Evaluate lmr_fit1 df_lmr_fit1 <- validate %>% dplyr::select(arr_delay) %>% cbind(., fitted_values = predict(lmr_fit1, newdata = validate)) %>% mutate(resid = arr_delay - fitted_values) gg1_lmr_fit1 <- ggplot(data = df_lmr_fit1, aes(x = arr_delay, y = fitted_values)) + geom_point(aes(alpha = .01)) + geom_abline(slope = 1, intercept = 0, colour = "red", size = 0.5) + ggtitle("Correct vs fitted values") + theme(legend.position="none") gg2_lmr_fit1 <- ggplot(data = df_lmr_fit1, aes(x = fitted_values, y = resid)) + geom_point(aes(alpha = .01)) + geom_smooth(method = "loess", se = FALSE, colour = "red", size = .5) + ggtitle("Residuals vs Fitted values") + theme(legend.position="none") gg1_lmr_fit1 gg2_lmr_fit1 postResample(lmr_class1, validate$arr_delay)
The optimal lambda for model 1 was 0.25.
Model 2 we've used all variables that have a correlation above absolute 0.15.
# Model 2 lmr_fit2 <- train(formula_short, data = training, method = Ridgereg_model, trControl = ctrl, preProc = "scale") # Predict lmr_class2 <- predict(lmr_fit2, newdata = validate) # Evaluted lmr_fit2 df_lmr_fit2 <- validate %>% dplyr::select(arr_delay) %>% cbind(., fitted_values = predict(lmr_fit2, newdata = validate)) %>% mutate(resid = arr_delay - fitted_values) gg1_lmr_fit2 <- ggplot(data = df_lmr_fit2, aes(x = arr_delay, y = fitted_values)) + geom_point(aes(alpha = .01)) + geom_abline(slope = 1, intercept = 0, colour = "red", size = 0.5) + theme(legend.position="none") gg2_lmr_fit2 <- ggplot(data = df_lmr_fit2, aes(x = fitted_values, y = resid)) + geom_point(aes(alpha = .01)) + geom_smooth(method = "loess", se = FALSE, colour = "red", size = .5) + theme(legend.position="none") gg1_lmr_fit2 gg2_lmr_fit2 postResample(lmr_class2, validate$arr_delay)
The optimal lambda for model 1 was 2.
Finally, we compare the two models to see which performed best.
resamp <- resamples(list(rreg_all_vars = lmr_fit1, rreg_high_cor = lmr_fit2)) summary(resamp)
As we can se from the output above, model 1, with all independent variables, have a lower Mean Absolute Error, a lower Root Mean Square Error and a higher Rsquared than model2 in our validation data, so we decide to use model1.
With model 1, lambda 0.25, we get the following results: ```{R, final_model, echo = TRUE} postResample(predict(lmr_fit1, newdata = testing), testing$arr_delay)
### Session info ```r sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.