In henkar91/AdvRprogr_lab7: Linear regression analysis

knitr::opts_chunk$set(echo = TRUE)

We start by loading the packages we need for performing the analysis, here I've chosen to work the nycflights13, tidyverse and caret. nycflights13 data set, contain different 5 different data sets for air traffic from NYC during 2013 which is the data we will work with, tidyverse is package loading several different useful packages that are related to tidyr and dplyr and last but not least caret, which we use for building the prediction models.

Also, we set the seed, which tells us where the random iteration starts, so we can reproduce our results. If we wouldn't set seed, the script would generate different results everytime we run the script.

# Packages ----------------------------------------------------------------

library(nycflights13)
library(caret)
library(tidyverse)
library(linreg)

set.seed(1991)

# Read data ---------------------------------------------------------------

flights <- nycflights13::flights
weather <- nycflights13::weather

Data processing

We start off the analysis by merging our two different data sets into one, but before merging, we remove some time variables that are present in both files to avoid duplicates in the merged file.

# Removed duplicated variables time stamps from weather (variables exists in bith weather and flights data)
weather <- weather %>%
    dplyr::select(-year, -month, -day, -hour)

# Join flights and weather data
df <- left_join(flights, weather, by = c("origin", "time_hour"))

When analysing the merged data frame we can see that there are several flights that are missing weather data. When we develop our prediction model, we want our data set to be tidy (one variable in each column, one observation in each row and no missing values). So to make sure there aren't just one airport or month that have all missing data we compute the weather_NA data frame.

# Flight missing weather information is distributed on all airports and months
weather_NA <- df %>%
    filter(is.na(temp)) %>%
    group_by(origin, month) %>%
    summarise(missing = n()) %>%
    spread(month, missing)

weather_NA

Since all airport have missing weather data and most months over the year (even if it more common in winter), we decide to remove all flights with missing weather data.

After a quick summary, we can see that there are a lot of variables that misses a few observations so we decide to remove these from the analysis as well (Output have been hidden in vignette).

lapply(df, FUN = function(x){summary(x)})

Finally we make a random subset of the data to reduce the volume, just for increasing calculation speed (this is something we only do for this exercise.)

# Delete all cases with missing value
df <- df[complete.cases(df), ] %>%
    dplyr::select(-year)


# Reducing the amount of data used for analysis
df <- df %>%
    sample_frac(.01)

Variable selection

We need to decide what covariates we want to include in the model. To get a hunch of what variables that could have "predictive power" for flight delays, we make compute the correlations between all variables.

We construct one model with all variables which have an absolute correlation above 0.15.

We also make some additional data cleaning, removing all variabels we believe won't improve the model.

# Select variables with predictive value
cor_matrix <- cor(df[sapply(df, is.numeric)])

has_effect <- cor_matrix[cor_matrix[,"arr_delay"] >= abs(0.15) ,"arr_delay"]

formula_short <- arr_delay ~ dep_delay + dep_time + hour


# Remove variables without effect
df <- df %>%
    dplyr::select(-c(sched_dep_time, carrier, flight, tailnum, dest,
              minute, time_hour, wind_gust))

Now when the data is prepared, we can start building our prediction model.

Partition data

First we partition the data set into 3 different groups, training, validation and testing.

Training data set is used to train the model for finding the optimal parameters. This data set contain 80% of the observations.
Validating data set is used to validate the result of our different models we develop. This data set contains 15 % of the observations.
Test data set is used for our final evaluation of our best prediction model. This data set contains 5% of the observations.

By splitting up the data set in different partitions, reduces the risk of overfitting the model, due to the fact that the data we validate haven't been used in the training process of our algorithm. The reason for splitting the testing and validating data is to make sure we don't tweak our model too much so it actually fit the validation set "better" than the real population of data. The test data set is only used once when you have developed a model you're satisfied with.

We also specify the resampling method. Here we uses 10-fold crossvalidation of the data repeated 3 times.

# Create test and training data -------------------------------------------

set.seed(1991)
# Create train and test data
in_train <- createDataPartition(y = df$arr_delay, 
                                p = .8, 
                                list = FALSE)

training <- df[in_train, ]
tmp <- df[-in_train, ]

# create validate and test data
in_validate <- createDataPartition(y = tmp$arr_delay, p = .75, list = FALSE)

validate <- tmp[in_validate, ]
testing <- tmp[-in_validate, ]

rm(tmp)


# Set resampling method ---------------------------------------------------

ctrl <- trainControl(method = "repeatedcv", 
                     repeats = 3)

Implementing self developed ridge regression in Caret package

Ridgereg_model <- list(
    type = c("Regression"),
    library = "linreg",
    loop = NULL,
    prob = NULL)

Ridgereg_model$parameters <- data.frame(parameter = "lambda",
                                        class = "numeric",
                                        label = "lambda")

Ridgereg_model$grid <- function (x, y, len = NULL, search = "grid"){
    data.frame(lambda = seq(0,2,by=0.25))
}

Ridgereg_model$fit <- function (x, y, wts, param, lev, last, classProbs, ...){
    dat <- if (is.data.frame(x))
        x
    else
        as.data.frame(x)
    dat$.outcome <- y

    output <-
        ridgereg$new(.outcome ~ ., data = dat, lambda = param$lambda, ...)

    output
}


Ridgereg_model$predict <- function(modelFit, newdata, preProc = NULL, submodels = NULL){
    if (!is.data.frame(newdata))
        newdata <- as.data.frame(newdata)
    modelFit$predict(newdata)
}

Model 1

The first model we build contains all variables in our data set. Since ridge regression penalizes for a lot of covariates, we can include all variables without overfitting the model.

# Model 1

# Train model
start <- Sys.time()
lmr_fit1 <- train(arr_delay ~ .,
                 data = training,
                 method = Ridgereg_model,
                 trControl = ctrl)
end <- Sys.time()
end-start
# Full data 15 min to run, lambda = 0, Rsquared = 87.124% MAE 10.87018

# Predict
lmr_class1 <- predict(lmr_fit1, newdata = validate)

# Evaluate
lmr_fit1

df_lmr_fit1 <- validate %>%
    dplyr::select(arr_delay) %>%
    cbind(., fitted_values = predict(lmr_fit1, newdata = validate)) %>%
    mutate(resid = arr_delay - fitted_values)


gg1_lmr_fit1 <- ggplot(data = df_lmr_fit1, aes(x = arr_delay, y = fitted_values)) +
    geom_point(aes(alpha = .01)) +
    geom_abline(slope = 1, intercept = 0, colour = "red", size = 0.5) +
    ggtitle("Correct vs fitted values") +
    theme(legend.position="none")

gg2_lmr_fit1 <- ggplot(data = df_lmr_fit1, aes(x = fitted_values, y = resid)) +
    geom_point(aes(alpha = .01)) +
    geom_smooth(method = "loess", se = FALSE, colour = "red", size = .5) +
    ggtitle("Residuals vs Fitted values") +
    theme(legend.position="none")

gg1_lmr_fit1
gg2_lmr_fit1

postResample(lmr_class1, validate$arr_delay)

The optimal lambda for model 1 was 0.25.

Model 2

Model 2 we've used all variables that have a correlation above absolute 0.15.

# Model 2
lmr_fit2 <- train(formula_short,
                  data = training,
                  method = Ridgereg_model,
                  trControl = ctrl,
                  preProc = "scale")

# Predict
lmr_class2 <- predict(lmr_fit2, newdata = validate)

# Evaluted
lmr_fit2

df_lmr_fit2 <- validate %>%
    dplyr::select(arr_delay) %>%
    cbind(., fitted_values = predict(lmr_fit2, newdata = validate)) %>%
    mutate(resid = arr_delay - fitted_values)


gg1_lmr_fit2 <- ggplot(data = df_lmr_fit2, aes(x = arr_delay, y = fitted_values)) +
    geom_point(aes(alpha = .01)) +
    geom_abline(slope = 1, intercept = 0, colour = "red", size = 0.5) +
    theme(legend.position="none")

gg2_lmr_fit2 <- ggplot(data = df_lmr_fit2, aes(x = fitted_values, y = resid)) +
    geom_point(aes(alpha = .01)) +
    geom_smooth(method = "loess", se = FALSE, colour = "red", size = .5) +
    theme(legend.position="none")


gg1_lmr_fit2
gg2_lmr_fit2
postResample(lmr_class2, validate$arr_delay)

The optimal lambda for model 1 was 2.

Model comparision

Finally, we compare the two models to see which performed best.

resamp <- resamples(list(rreg_all_vars = lmr_fit1,
                         rreg_high_cor = lmr_fit2))

summary(resamp)

As we can se from the output above, model 1, with all independent variables, have a lower Mean Absolute Error, a lower Root Mean Square Error and a higher Rsquared than model2 in our validation data, so we decide to use model1.

Result of final model

With model 1, lambda 0.25, we get the following results: ```{R, final_model, echo = TRUE} postResample(predict(lmr_fit1, newdata = testing), testing$arr_delay)