In this file we try to predict the delay of each flight using the ridgereg()
function from the package lab7ab
.
library(devtools) library(caret) library(nycflights13) library(dplyr) #install_github("brbatv/lab7",force=TRUE) library(lab7ab)
Exploring the dataset from nycflights13
library.
The dataset weather has the following variables:
colnames(weather)
The dataset flights has the following variables:
colnames(flights)
The two datasets have some common variable such as: year
,month
, day
, origin
, hour
, time_hour
so a new dataset is created by adding extra informations to the flights
dataset.
flights_weather <- left_join(flights, weather,by = c("year", "month", "day", "origin", "hour", "time_hour"))
The weather variables are interesting to be selected for the prediction so the database
has been created with the added columns:
database <- select(flights_weather,month,day, sched_dep_time,sched_arr_time, dep_delay, arr_delay,temp,dewp,wind_dir, wind_speed,wind_gust,precip,pressure,visib)
To use the caret
package all the missing values need to be taked off from the database.
The data, at this point, have been splitted into three new dataset:
\itemize{
\item train 80% of the observations \item test 5% of the observations \item validation 15% of the observations
}
library(caret) set.seed(3456) database<-na.omit(database) trainIndex = createDataPartition(database$arr_delay, p=0.8, list=FALSE,times=1) train = database[trainIndex,] notrain = database[-trainIndex,] validationIndex = createDataPartition(notrain$arr_delay, p=0.75, list=FALSE,times=1) validation=database[validationIndex,] test = notrain[-validationIndex,]
Training on the train dataset the best value obtained for lambda
is $10^{8}$.
#root mean squared arror rmse <- function(error){ sqrt(mean(error^2)) } #formula for ridge regression formula=arr_delay~month+day+sched_dep_time+sched_arr_time+dep_delay+temp+dewp+wind_dir+ wind_speed+wind_gust+precip+pressure+visib #training ridge regression for different values of lambda mod<-ridgereg$new(formula=formula, data=train,lambda=10^8) Y_validation<-validation$arr_delay Y_predicted <- mod$predict(newdata=validation) #calculating the error error= Y_predicted - Y_validation RMSE_check<-rmse(error)
The best lambda has been found, with it the test set has been predicted and the RMSE of the predicted model is:
best_lambda=10^8 mod<-ridgereg$new(formula=formula, data=train,lambda=best_lambda) Y_test<-test$arr_delay Y_predicted2 <- mod$predict(newdata=test) error2= Y_predicted2 - Y_test RMSE_best_lambda<-rmse(error2) RMSE_best_lambda
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.