knitr::opts_chunk$set(echo = TRUE, tidy = TRUE, cache = 1, tidy.opts=list(blank=FALSE, width.cutoff=60))
For Tasks 1--4 use the following data
library(tidyverse) mpg %>% tbl_df
x
mapped to displ
y
mapped to hwy
colour
mapped to trans
point
point
ggplot(mpg, aes(x = displ, y = hwy, colour = class)) + geom_point()
x
$\rightarrow$ displ
y
$\rightarrow$ hwy
point
'red'
smooth
colour
ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(colour = 'red') + geom_smooth()
x
$\rightarrow$ displ
y
$\rightarrow$ hwy
colour
$\rightarrow$ drv
point
smooth
method='lm'
to get linear (rather than LOESS) regressionse=FALSE
to suppress confidence bandsggplot(mpg, aes(x = displ, y = hwy, colour = drv)) + geom_point() + geom_smooth(method = "lm", se = FALSE)
x
$\rightarrow$ displ
y
$\rightarrow$ hwy
colour
$\rightarrow$ cyl
(as discrete values)point
smooth
#wrong ggplot(mpg, aes(x = displ, y = hwy, colour = cyl)) + geom_point() + geom_smooth() #right ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) + geom_point() + geom_smooth()
For Task 5, use the following data
library(tidyverse) mpg %>% tbl_df
ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) + geom_point() + geom_smooth(method = "lm", se = FALSE)
x
and y
axes areDisplacement
MPG, highway
Cylinders
Fuel economy and engine size
year
Model year 1999
and Model year 2008
ggplot(mpg, aes(x = displ, y = hwy, colour = factor(cyl))) + geom_point() + geom_smooth(method = "lm", se = FALSE) + scale_x_log10(breaks = 2:7) + scale_y_log10(breaks = c(20, 30, 40)) + facet_wrap( ~ year, labeller = as_labeller(c('1999' = 'Model year 1999', '2008' = 'Model year 2008'))) + labs(x = 'Displacement', y = 'MPG, highway', colour = 'Cylinders', title = 'Fuel economy and engine size')
For tasks 6--11, use the following data:
library(tidyverse) data(mpg, package = 'ggplot2') mpg %>% tbl_df
select
verb to extract the following columns from mpg
: manufacturer
, model
, displ
, year
, cyl
, trans
, cty
, hwy
and view the output, which should look something like this:(mpg2 <- mpg %>% select(`manufacturer`, `model`, `displ`, `year`, `cyl`, `trans`, `cty`, `hwy`))
mpg2
that contains this subset of columns (i.e.):mpg2 <- mpg %>% [ your code from the previous step ]
mpg2
, use the mutate
verb to create two new columns:displ2
is equal to the square of displacement (i.e., displ * displ
or displ^2
)vol_per_cyl
that is equal to displ
divided by cyl
(i.e., displ / cyl
, rounded to the nearest 1/100th of a liter (i.e., two decimal places)(mpg3 <- mpg2 %>% mutate( displ2 = displ^2, vol_per_cyl = round(displ / cyl,2) ))
mpg3
that contains this new data frame with the two extra columnsmpg3 <- mpg2 %>% [ your code from the previous step ]
mpg3
, use the arrange
verb to re-order the rows in descending order of vol_per_cyl
mpg3 %>% arrange( desc( vol_per_cyl))
mpg3
, use the filter
verb to extract the subset of rows corresponding to the manufacturer
chevrolet
, then use arrange
to place these rows in descending order by vol_per_cyl
mpg3 %>% filter( manufacturer == 'chevrolet' ) %>% arrange( desc(vol_per_cyl))
mpg3
, use the group_by
verb to group this data frame by manufacturer
and year
. Then, for each unique pair of manufacturer
and year
, calculate the largest value of vol_per_cyl
. Call this new column max_vol_per_cyl
.(mpg4 <- mpg3 %>% group_by( manufacturer, year ) %>% summarise( max_vol_per_cyl = max(vol_per_cyl) ))
mpg4
that contains this new data framempg4 <- mpg3 %>% [ your code from the previous step ]
mpg4
, use the spread
verb from the tidyr
package to create a data frame that looks like this:(mpg5 <- mpg4 %>% spread( year, max_vol_per_cyl ))
Hint: the key/value pair in mpg4
is made up of year
(key) and max_vol_per_cyl
(value).
Create a new data frame called mpg5
that contains this new data frame
mpg5 <- mpg4 %>% [ your code from the previous step ]
Working with mpg5
, create a new column representing the difference between the value of the columns 2008
and 1999
. Call this variable change
Note: Refer to the column names using the backtick character (`
) so that R knows you mean 2008
the column name, and not 2008
the number. That is, you should write code referring to the column as `2008`
and not 2008
.
(mpg6 <- mpg5 %>% mutate( change = `2008` - `1999` ))
mpg6
that contains this new data framempg6 <- mpg5 %>% [ your code from the previous step ]
mpg6
, rename the columns 1999
and 2008
to be max_vpc_1999
and max_vpc_2008
.gather
verb to create a key/value pair based on the three numeric columns, with the key column called variable
and the value column called value
. %>% View
at the end of your code to see all of the data in tabular form via the RStudio GUI).mpg6 %>% rename( max_vpc_1999 = `1999`, max_vpc_2008 = `2008` ) %>% gather( variable, value, -manufacturer ) %>% as.data.frame
nycflights13
package:install.packages('nycflights13')
library(tidyverse) library(nycflights13) flights %>% tbl_df airlines %>% tbl_df weather %>% tbl_df
flights
and weather
data frames, named flights2
and weather2
.flights2 <- flights %>% select(origin, year, month, day, hour, sched_dep_time, dep_delay, carrier) weather2 <- weather %>% select(origin, year, month, day, hour, precip, wind_speed, visib )
flights
and airlines
. The output should look like this:flights2 %>% inner_join( airlines )
flights2
and weather2
that includes all rows from flights2
. Your output should look like this:flights2 %>% left_join( weather2 )
NA
values under precip
, wind_speed
, and visib
?na.rm=TRUE
allows us to ignore any weather data where one of these variables might be missing (i.e., NA
). You can execute this code or just look at mine:weather2 %>% summarise(min_precip = min(precip, na.rm = TRUE), min_wind = min(wind_speed, na.rm = TRUE), max_visib = max(visib, na.rm = TRUE) )
flights2
joined with weather2
good_weather_delays
inner_join
and another call to filter
. Use the %>%
operator to chain all of this togethergood_weather_delays
should look like this:(good_weather_delays <- flights2 %>% inner_join(weather2, by = c("origin", "year", "month", "day", "hour")) %>% filter( precip == 0 & wind_speed == 0 & visib == 10 ) )
good_weather_delays
)dep_delay
)airlines
data frame to get the full name of the airlines(avg_good_weather_delays <- good_weather_delays %>% group_by(carrier) %>% summarise( dep_delay = mean(dep_delay,na.rm=TRUE) ) %>% arrange(desc(dep_delay)) %>% inner_join(airlines, by = "carrier"))
avg_good_weather_delays
avg_good_weather_delays <- [ your code from the previous step ]
ranked_airline_labels <- avg_good_weather_delays %>% transmute(carrier, name = factor(-row_number(), labels = name)) good_weather_delays %>% inner_join(ranked_airline_labels) %>% ggplot( aes( x = name, y = dep_delay ) ) + stat_summary() + coord_flip() + labs(x='', y = 'Average departure delay', title = 'Departure delays under ideal weather conditions\nNYC airports, 2013' )
avg_good_weather_delays
to create an ordered factor based on the name of the airlines. The ordered factor causes ggplot to arrange the airlines in descending order based on their average values. This is the code I wrote to do this:ranked_airline_labels <- avg_good_weather_delays %>% transmute(carrier, name = factor(-row_number(), labels = name))
ranked_airline_labels
) with the good_weather_delays
data frame, then passed the result into ggplot2name
to the x
(not y
) axis, and mapped dep_delay
to the y
axis. This is reverse from what you see, but I did it because stat_summary
, which creates the points with error bars, expects to summarize over y
coord_flip()
to my ggplot call, which transposes the plot by 90 degreeslibrary(tidyverse) library(nycflights13) ## Task 1 ggplot(mpg, aes(x = displ, y = hwy, colour = trans)) + geom_point() ## Task 2 ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(colour='red') + geom_smooth() ## Task 3 ggplot(mpg, aes(x = displ, y = hwy, colour=drv)) + geom_point() + geom_smooth(method="lm", se=FALSE) ## Task 4 ggplot(mpg, aes(x = displ, y = hwy, colour=factor(cyl))) + geom_point() + geom_smooth() ## Task 5 ggplot(mpg, aes( x = displ, y = hwy, colour = factor(cyl))) + geom_point() + geom_smooth(method="lm",se=FALSE) + scale_x_log10(breaks=2:7) + scale_y_log10(breaks=c(20,30,40)) + facet_wrap(~year, labeller = as_labeller(c('1999' = 'Model year 1999', '2008' = 'Model year 2008'))) + labs( x = 'Displacement', y = 'MPG, highway', colour = 'Cylinders', title = 'Fuel economy and engine size' ) ## Task 6 mpg2 <- mpg %>% select(manufacturer, model, displ, year, cyl, trans, cty, hwy) ## Task 7 mpg3 <- mpg2 %>% mutate( displ2 = displ^2, vol_per_cyl = round(displ / cyl,2) ) ## Task 8 mpg3 %>% arrange( desc( vol_per_cyl)) mpg3 %>% filter( manufacturer == 'chevrolet' ) %>% arrange( desc(vol_per_cyl)) mpg4 <- mpg3 %>% group_by( manufacturer, year ) %>% summarise( max_vol_per_cyl = max(vol_per_cyl) ) ## Task 9 mpg5 <- mpg4 %>% spread( year, max_vol_per_cyl ) ## Task 10 mpg6 <- mpg5 %>% mutate( change = `2008` - `1999` ) ## Task 11 mpg6 %>% rename( max_vpc_1999 = `1999`, max_vpc_2008 = `2008` ) %>% gather( variable, value, -manufacturer ) %>% as.data.frame ## Task 12 flights2 <- flights %>% select(origin, year, month, day, hour, sched_dep_time, dep_delay, carrier) weather2 <- weather %>% select(origin, year, month, day, hour, precip, wind_speed, visib ) flights2 %>% inner_join( airlines ) flights2 %>% left_join( weather2 ) ## Task 13 weather2 %>% summarise(min_precip = min(precip,na.rm=TRUE), min_wind = min(wind_speed,na.rm=TRUE),max_visib = max(visib,na.rm=TRUE)) good_weather_delays <- flights2 %>% inner_join(weather2) %>% filter( precip == 0 & wind_speed == 0 & visib == 10 ) avg_good_weather_delays <- good_weather_delays %>% group_by(carrier) %>% summarise( dep_delay = mean(dep_delay,na.rm=TRUE) ) %>% arrange(desc(dep_delay)) %>% inner_join(airlines) ## Task 14 ranked_airline_labels <- avg_good_weather_delays %>% transmute( carrier, name = factor(-row_number(),labels=name) ) good_weather_delays %>% inner_join(ranked_airline_labels) %>% ggplot( aes( x = name, y = dep_delay ) ) + stat_summary() + coord_flip() + labs(x='', y = 'Average departure delay', title = 'Departure delays under ideal weather conditions\nNYC airports, 2013' )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.