In RamiKrispin/coronavirus: The 2019 Novel Coronavirus COVID-19 (2019-nCoV) Dataset

knitr::opts_chunk$set(echo = TRUE, 
                      message=FALSE, 
                      warning=FALSE, 
                      fig.height=5, 
                      fig.width=8,
                      collapse = TRUE,
                      comment = "#>")

The coronavirus dataset

The coronavirus dataset provides a daily summary of COVID-19 cases by geographic location (i.e., country/province). It includes total daily confirmed, death, and recovery^1^ cases. Let’s load the dataset from the coronavirus package:

library(coronavirus)

data(coronavirus)

The dataset has the following fields:

date - The date of the summary
province - The province or state, when applicable
country - The country or region name
Lat - Latitude point
Long - Longitude point
type - the type of case (i.e., confirmed, death)
cases - the number of daily cases (corresponding to the case type)
uid - Country code
iso2 - Officially assigned country code identifiers with two-letter
iso3 - Officially assigned country code identifiers with there-letter
code3 - UN country code
combined_key - Country and province (if applicable)
population - Country or province population
continent_name - Continent name
continent_code - Continent code

We can use the head and str functions to see the structure of the dataset:

head(coronavirus)

str(coronavirus)

^1^ Recovery data is discontinued from Aug 5th, please see the following issue for more details.

Querying and analyzing the coronavirus dataset

In the example below, we will use the dplyr and tidyr packages to query, transform, reshape, and keep the data tidy, the plotly package to plot the data, and the DT package to view it:

library(dplyr)
library(tidyr)
library(plotly)
library(DT)

Cases summary

Let's start with summarizing the total number of cases by type as of r max(coronavirus$date) and then plot it:

total_cases <- coronavirus %>%
  filter(type != "recovery") %>%
  group_by(type) %>%
  summarise(cases = sum(cases)) %>%
  mutate(type = factor(type, levels = c("confirmed", "death"))) 

total_cases

Likewise, we can summarise the data by continent using the continent_name field:

coronavirus %>%
  filter(type != "recovery") %>%
  group_by(type, continent_name) %>%
  summarise(cases = sum(cases), .groups = "drop") %>%
  mutate(type = factor(type, levels = c("confirmed", "death"))) %>%
  pivot_wider(names_from = type, values_from = cases) %>%
  mutate(death_rate = death / confirmed) %>%
  filter(!is.na(continent_name)) %>%
  arrange(-death_rate) %>%
  datatable(rownames = FALSE,
            colnames = c("Continent", "Confrimed Cases", "Death Cases","Death Rate %")) %>%
  formatPercentage("death_rate", 2)

You can use those numbers to derive the current worldwide death rate (percentage):

round(100 * total_cases$cases[2] / total_cases$cases[1], 2)

Worldwide aggregation

Let's group the data by the date and case type fields and aggregate the total cases (confirmed and death cases) to the worldwide level and plot the two side by side:

df <- coronavirus %>%
  filter(type != "recovery") %>%
  group_by(date,type) %>%
  summarise(total = sum(cases), .groups = "drop")

p_1 <- plot_ly(data = df %>% filter(type == "confirmed"),
        x = ~ date,
        y = ~ total,
        name = "Confirmed",
        type = "scatter",
        mode = "line") %>%
  layout(yaxis = list(title = "Cases"),
         xaxis = list(title = ""))

p_2 <- plot_ly(data = df %>% filter(type == "death"),
              x = ~ date,
              y = ~ total,
              name = "Death",
              line = list(color = "red"),
              type = "scatter",
              mode = "line") %>%
  layout(yaxis = list(title = "Cases"),
         xaxis = list(title = "Source: Johns Hopkins University Center for Systems Science and Engineering"))

subplot(p_1, p_2, nrows = 2, 
              titleX = TRUE,
              titleY = TRUE) %>%
  layout(title = "Worldwide - Daily Confirmed and Death Cases",
         margin = list(t = 60, b = 60, l = 40, r = 40),
         legend = list(x = 0.05, y = 1)
         )

Top effected countries

The next table provides an overview of the ten countries with the highest confirmed cases. We will use the datatable function from the DT package to view the table:

confirmed_country <- coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases)) %>%
  mutate(perc = total_cases / sum(total_cases)) %>%
  arrange(-total_cases)

confirmed_country %>%
  head(10) %>%
  datatable(rownames = FALSE,
            colnames = c("Country", "Cases", "Perc of Total")) %>%
  formatPercentage("perc", 2)

The next plot summarize the distribution of confirmed cases by country:

conf_df <- coronavirus %>% 
  filter(type == "confirmed") %>%
  group_by(country) %>%
  summarise(total_cases = sum(cases)) %>%
  arrange(-total_cases) %>%
  mutate(parents = "Confirmed") %>%
  ungroup() 

  plot_ly(data = conf_df,
          type= "treemap",
          values = ~total_cases,
          labels= ~ country,
          parents=  ~parents,
          domain = list(column=0),
          name = "Confirmed",
          textinfo="label+value+percent parent")

Death rates

Similarly, we can use the pivot_wider function from the tidyr package (in addition to the dplyr functions we used above) to get an overview of the confirmed and death cases and calculate the death rates:

coronavirus %>% 
  filter(country != "Others") %>%
  group_by(country, type) %>%
  summarise(total_cases = sum(cases)) %>%
  pivot_wider(names_from = type, values_from = total_cases) %>%
  arrange(- confirmed) %>%
  mutate(death_rate = death / confirmed)  %>%
  datatable(rownames = FALSE,
            colnames = c("Country", "Confirmed","Death", "Death Rate")) %>%
   formatPercentage("death_rate", 2)