knitr::opts_chunk$set(echo = TRUE)
The task of this project was to describe the relationships between number of total bike rentals and variables that seem to be of importance. From the results of this analysis the hour of the day seems to be of importance for how many bikes that are being rent.
The data used in this analysis is called bikes and is automatically loaded with the package bikerentals. The data set consist of hourly rental data, of the first 19 days of each month, spanning two years.
First we need to install and load all the packages needed. Start install the packages tidyverse, lubridate, randomForest and devtools, if you don“t already have them installed. Then you need to load the devtools package before you can install the bikerentals package from github. Load all the packages required.
#install.packages("tidyverse") #install.packages("lubridate") #install.packages("randomForest") #install.packages("devtools") library(devtools) #Install the bikerentals package from the gitHub repository "Reproduce" of "idahans" install_github("idahans/Reproduce") library(lubridate) library(tidyverse) library(randomForest) library(bikerentals)
The bikes data comes with the packages bikerentals. Start with how the data set looks like.
head(bikes)
Look at the structure of the data set
str(bikes)
Set the variables season, holiday, workingday and weather as factors.
bikes$season <- as.factor(bikes$season) bikes$holiday <- as.factor(bikes$holiday) bikes$workingday <- as.factor(bikes$workingday) bikes$weather <- as.factor(bikes$weather)
Start look at the summary of the data set and see if there are any missing values (NAs) in the data.
summary(bikes)
There is no missing values in the data.
Here we use the extract_var_function
from the bikerentals packages.
Extract the datetime column into new columns of hour, day and month, and set these as factors. Extract only the essential variables to a new data frame called variables. We are not using the information from causal and registered, so these columns will be excluded here. Since we have the datetime as hour, day and month now, we also exclude the column datetime.
variables <- extract_var(bikes) head(variables)
Here we use the keep_all_function
from the bikerentals package. This function does the same as ectract_var_function
but also includes the count column. We use this function to make a new data set called bikes_all.
bikes_all <- keep_all(bikes) head(bikes_all)
Now we can look at the structure of bikes_all and see that wee also have hour, day and month as factors.
str(bikes_all)
Now we want to use the Random Forest model to compute the Variable Important Measure and find a ranking of variables that seems to be of importance.
ranfor <- randomForest(variables, bikes$count, ntree = 100, importance = TRUE) imp <- importance(ranfor, type = 1) variableImportance <- data.frame(Variable = row.names(imp), Importance = imp[,1]) variableImportance
Since Random Forests are stochastic by nature, the results here may change slightly from run to run. However, the results here will show that hour have the highest level of importance. We can make a plot of this to get a better overview.
ggplot(variableImportance, aes(x=reorder(Variable, Importance), y=Importance)) + geom_bar(stat="identity", fill="turquoise") + coord_flip() + xlab("") + ylab("Importance") + ggtitle("Random Forest Variable Importance\n") + theme_classic()
Since the hour of the day seems to be of importance to the demand of bike rentals we can make a plot of total number of bike rentals and hour of the day.
bikes_all %>% group_by(hour) %>% summarize(total = sum(count)) %>% ggplot() + geom_bar(aes(hour, total), stat = "identity", color = "black", fill = "purple") + ggtitle("Total Number of Bike Rentals Across Hours of the Day") + xlab("Hour of the day") + ylab("Number of total bike rentals") + theme_classic()
In this plot we can see the distribution of bike rentals during the hours of the day, and it seems that most bikes are rent at 8 O'clock in the morning and at 5 and 6 in the evening.
Fanaee-T, Hadi, and Gama, Joao, Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.
Breiman and Cutler's Random Forests for Classification and Regression https://www.stat.berkeley.edu/~breiman/RandomForests/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.