In LUMC/rcourse: R Open Online Course (ROOC)

knitr::opts_chunk$set(comment = NA)

Introduction

You will analyse the storms table which comes with the tidyverse package.
Make sure you put library( tidyverse ) in the R chunk at the top of your R Markdown file as shown here below:

library( tidyverse )

After the library has been loaded you will have access to the table in the storms variable. Each row of storms table is an observation of a storm recorded at a certain moment (date and time) at a geographical location (lat, long). Some additional storm features (wind speed, pressure, ...), classifications (status, category) and a name are also included.

For more details you may consult the help on storms tibble with ?storms but the following column description is sufficient for the SSA:

name: Name of the storm.
year, month, day, hour: Date and time of the observation.
lat, long: Geographical location of the storm centre (numbers).
wind: Wind speed (number, in knots).
pressure: Pressure at the storm's centre (number, in millibars).
tropicalstorm_force_diameter (or ts_diameter in older versions of tidyverse library): Storm diameter (number, in nautical miles).
status: Storm classification (a factor, many levels).
category: Storm category (a number, range: -1..5; many values are missing).

Note, that a single storm is usually observed multiple times (so one storm may be described in multiple rows).

Here is a random part of the table (some columns are omitted):

set.seed(1234L)
bind_rows(
  storms %>% filter( status == "hurricane" ) %>% sample_n( 3L ),
  storms %>% filter( status != "hurricane" ) %>% sample_n( 3L )
) %>% 
  select( name, year, month, lat, long, status, category, wind, pressure ) %>% 
  arrange( year, month )

Questions

Question 1: [4p] Percentage of storms with category at least 4.

Out of all storm measurements with non-missing category value, calculate the percentage of the storm observations that have category at least 4. Find how to use round to round the result to 2 decimal places. Assign the result to the largeCategoryPercentage variable.

# largeCategoryPercentage <- ...

# 1p the condition >= (at least 4) is correct
# 1p the number of observations with category known correct
# 1p the percentage correct
# 1p the rounding correct

largeCategoryPercentage <- 
  ( storms %>% filter( !is.na( category ), category >= 4 ) %>% nrow() ) / 
  ( storms %>% filter( !is.na( category ) ) %>% nrow() ) * 
  100
largeCategoryPercentage <- round( largeCategoryPercentage, 2 )
largeCategoryPercentage

Question 2: [3p] Changing factor levels, counting occurrences.

Take the data from the status column and change the order of levels such that the first three levels are ("tropical storm", "tropical depression", "hurricane") (in exactly this order).
Then, produce a table of counts of the number of observations for each storm status level.
Store the result in statusCounts variable.
Note: Do not modify the original storms table (a changed table may not work in other questions).

# statusCounts <- ...

# 1p some fct levels reordering is done
# 1p the order of levels is correct
# 1p the table of counts is correct

statusCounts <- storms$status %>% 
  fct_relevel( "tropical storm", "tropical depression", "hurricane" ) %>% 
  fct_count()
statusCounts

Question 3: [7p] Table summary in a list.

Create a list with some summaries of the storms table and assign this list to the variable stormsSummary. The list should have the following three elements:

obsNum -- the number of observations in the storms table,
avgWind -- the mean of observed wind speeds (force removal of missing values),
uniqueNames -- a character vector of names from the name column with duplicates removed, sorted in alphabetical order.

# stormsSummary <- ...

# 1p there is a list
# 1p elements in the list have names
# 1p obsNum is correct (nrow)
# 1p mean is calculated 
# 1p NAs are skipped in mean calculation
# 1p names are uniqued
# 1p names are sorted

stormsSummary <- list(
  obsNum = nrow( storms ),
  avgWind = mean( storms$wind, na.rm = TRUE ),
  uniqueNames = storms$name %>% unique() %>% sort()
)
stormsSummary

Question 4: [6p] Dropping summer storms

Create a new tibble stormsNoSummer that contains all observations from storms except those that were made in a summer. Consider 21st of June to be the first day of summer and 22nd of September to be the last day of summer.

# stormsNoSummer <- ...

# 1p any filtering is done
# 1p the filtering is correct for months < 6
# 1p the filtering is correct for months == 6
# 1p the filtering is correct for months == 7,8
# 1p the filtering is correct for months == 9
# 1p the filtering is correct for months > 9

stormsNoSummer <- storms %>% 
  filter( 
    month < 6 | 
    ( month == 6 & day < 21 ) | 
    ( month == 9 & day > 22 ) |
    month > 9
  )
stormsNoSummer

Question 5: [6p] Summarizing storms by month.

Build a tibble reporting the fastest wind and the lowest pressure observed over all years in each month. Report also the total number of observations for each month. During the min/max calculations force omitting possible missing values in the respective columns.
The final table should have four columns: month, fastestWind, lowestPressure, obsNum and it should be sorted in descending order of the number of observations (the most frequent at the top row). Store the result in the variable stormsByMonth.

# stormsByMonth <- ...

# 1p grouping is good
# 1p obsNum is correct
# 1p lowestPressure is correct (NAs removed)
# 1p fastestWind is correct (NAs removed)
# 1p the table is sorted
# 1p the table is sorted in descending order

stormsByMonth <- storms %>% 
  group_by( month ) %>% 
  summarise(
    fastestWind = max( wind, na.rm = TRUE ),
    lowestPressure = min( pressure, na.rm = TRUE ),
    obsNum = n()
  ) %>% 
  arrange( desc( obsNum ) )
stormsByMonth

Question 6. [4p] Cross-tabulation

Create a tibble stormsByStatusAndMonth that contains a cross-tabulation of status and month. The result should be a table with status represented by rows, month in columns, and table values representing the number of observations for each combination of month and status values. Some entries in the crosstable will be NA: check the manual and fill them with zeros.

# stormsByStatusAndMonth <- ...

# 1p counting is correct
# 1p spreading is used
# 1p spreading is correct
# 1p NAs are replaced by zeroes

stormsByStatusAndMonth <- storms %>% 
  count( status, month ) %>% 
  pivot_wider( names_from = month, values_from = n, values_fill = 0L )
  #spread( month, n, fill = 0L )
stormsByStatusAndMonth

Question 7. [9p] Adding wind speed in km/h and its category.

Wind speed in the wind column is given in knots. Create a new column windKPH that expresses wind speed in km/h (1 knot = 1.852 km/h). Then, create a new column windCategory that contains a factor with levels "low", "medium", "high" (exactly in that order). The levels should be determined by the windKPH column values: "low" for windKPH < 75, "medium" for windKPH < 150, and "high" otherwise. The final table should only have columns: name, windCategory and windKPH (exactly in this order). Store the result in the variable stormsWithWindCategory.

# stormsWithWindCategory <- ...

# 1p windKPH column added
# 1p windKPH is correct
# 1p windCategory column added
# 1p windCategory at least one category has correct condition
# 1p windCategory all categories have correct conditions
# 1p windCategory is a factor
# 1p windCategory has correct levels
# 1p the table has correct columns
# 1p the table has correct column order

stormsWithWindCategory <- storms %>% 
  mutate( windKPH = wind * 1.852 ) %>% 
  mutate( windCategory = case_when(
    windKPH < 75 ~ "low",
    windKPH < 150 ~ "medium",
    TRUE ~ "high"
  ) %>% factor( levels = c( "low", "medium", "high" ) ) ) %>% 
  select( name, windCategory, windKPH )
stormsWithWindCategory

Question 8: [7p] A box plot.

Based on the storms tibble create a box plot:

The vertical axis should represent pressure.
The horizontal axis: in aes(...) instead of wind use factor(wind) (to make wind a categorical variable).
Use gray box fill and blue colour.
Adjust the vertical title to "Pressure [millibars]" and horizontal to "Wind speed [knots]".
Use the black/white theme.

# ggplot( ... ) + ...

# 1p geom_boxplot is created
# 1p the vertical axis is correct
# 1p the horizontal axis is correct
# 1p both titles are correct
# 1p the theme is correct
# 1p the colour is set
# 1p the fill is set

ggplot( storms ) + 
  aes( x = factor( wind ), y = pressure ) +
  geom_boxplot( fill = "gray", color = "blue" ) + 
  labs( y = "Pressure [millibars]", x = "Wind speed [knots]" ) +
  theme_bw()

Question 9: [8p] Scatter plot

For this scatter plot take from storms only the rows with a missing tropicalstorm_force_diameter (or ts_diameter) value. Use long for the horizontal axis and lat for the vertical. Use transparency level of 0.5 and point size of 0.75. Colour points according to wind. Finally, use the colour scale with green for low and red for high wind values.

# ggplot( ... ) + ...

# 1p geom_point is used
# 1p rows with missing ts_diameter are selected
# 1p the horizontal axis is correct 
# 1p the vertical axis is correct
# 1p the transparency is set
# 1p the point size is set
# 1p wind is used for colour in aes
# 1p the colour scale is set

filteredStorms <- storms %>% filter( is.na( tropicalstorm_force_diameter ) )
ggplot( filteredStorms ) + 
  aes( x = long, y = lat, color = wind ) +
  geom_point( alpha = 0.5, size = 0.75 ) +
  scale_color_gradient( low = "green", high = "red" )