In RamiKrispin/coronavirus: The 2019 Novel Coronavirus COVID-19 (2019-nCoV) Dataset

knitr::opts_chunk$set(
  fig.height=5, fig.width=8, message=FALSE, warning=FALSE,
  collapse = TRUE,
  comment = "#>"
)

This vignette provides an overview to the covid19_vaccine dataset.

Data

Before getting started, please note the following:

It may be challenging to compare countries as each may use different count methods for the given doses.
In addition, the dose summary includes multiple types of vaccines, and there is no information if the doses are a single or multiple doses series.

Data sources:

Vaccine data - https://github.com/govex/COVID-19/tree/master/data_tables/vaccine_data/global_data
Country/province codes and population - https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv
Continent names and code - DATA HUB, https://datahub.io/JohnSnowLabs/

Load the data

Let's start by loading the packages we will use in this vignette:

library(coronavirus)
library(dplyr)
library(plotly)

We load the dataset from the coronavirus package, and use the dplyr and plotly packages to manipulate and plot the data. Let's load the data:

data("covid19_vaccine")

head(covid19_vaccine)

The covid19_vaccine dataset provides time-series data on the vaccination progress by country or province (if applicable). Likewise, the coronavirus dataset, the COVID19 vaccine raw data is collected by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE). The covid19_vaccine data includes the following fields:

date - Data collection date in YYYY-MM-DD format
country_region - Country or region name
continent_name - Continent name
continent_code - Continent code
combined_key - Country and province (if applicable)
doses_admin - Cumulative number of doses administered. When a vaccine requires multiple doses, each one is counted independently
people_at_least_one_dose - Cumulative number of people who received at least one vaccine dose. When the person receives a prescribed second dose, it is not counted twice
population - Country or province population
uid - Country code
iso2 - Officially assigned country code identifiers with two-letter
iso3 - Officially assigned country code identifiers with three-letter
code3 - UN country code
fips - Federal Information Processing Standards code that uniquely identifies counties within the USA
lat - Latitude
long - Longitude

Note: The country / province code fields (e.g., ios2, ios3, etc.) and population were merged with the raw data

Refresh the data

The data in the package is up-to-date with the last date the package was updated on CRAN. To get the most recent data available on the source (data get updated daily), you can use one of the following methods:

Use the update_dataset which will re-install the package from the package repository. Please see the vignette for more details
Pull the data directly from the CSV file on the repository:

library(readr)

vaccine_df <- read_csv(file = "https://raw.githubusercontent.com/RamiKrispin/coronavirus/main/csv/covid19_vaccine.csv",
                       col_types = cols(
                         date = col_date(format = ""),
                         country_region = col_character(),
                         continent_name = col_character(),
                         continent_code = col_character(),
                         combined_key = col_character(),
                         doses_admin = col_number(),
                         people_at_least_one_dose = col_number(),
                         population = col_number(),
                         uid = col_number(),
                         iso2 = col_character(),
                         iso3 = col_character(),
                         code3 = col_number(),
                         fips = col_logical(),
                         lat = col_number(),
                         long = col_number()
                       )) 

# Fixing the continent code field - changing NA to "NA" for North America
vaccine_df$continent_code <- ifelse(is.na(vaccine_df$continent_code) & vaccine_df$continent_name == "North America", "NA", vaccine_df$continent_code)

head(vaccine_df)

The plan, in the next version, is to add a similar function to the refresh_coronavirus_jhu function to pull the vaccine dataset.

Analysing the vaccine dataset

The dataset includes two cumulative counts of vaccine doses received:

people_at_least_one_dose - counts both single and multiple doses of vaccine as one dose
doses_admin - counts each dose received separately

So, for example, if a person received a multi doses vaccine such as the Pfizer or Moderna vaccine, here is the count by each method:

If the person completed the two doses, the people_at_least_one_dose will count it as one dose, and the doses_admin will count it as two
Similarly, if the person received only the first out of two doses and skipped the second, it will be counted in both variables as one

We can assume that the last case is rare; therefore, the people_at_least_one_dose could be a good proxy for fully vaccinated people. Let’s see the differences between the two by using the US vaccine data:

covid19_vaccine |> 
  filter(country_region == "US") |>
  select(date, doses_admin, total = people_at_least_one_dose) |>
  arrange(date) |>
  plot_ly() |>
  add_lines(x = ~ date,
            y = ~ doses_admin,
            name = "Total Doses") |>
  add_lines(x = ~ date,
            y = ~ total,
            name = "Recevied Vaccine with One/Two Doses") |>
  layout(title = "US Cumulative Number of Doses by Type",
         legend = list(x = 0.05, y = 0.95),
         margin = list(l = 50, r = 50, b = 50, t = 90),
         yaxis = list(title = "Number of Doses"),
         xaxis = list(title = ""))

Here are some thoughts about how to leverage this dataset:

While it is not entirely accurate, as mentioned above, the people_at_least_one_dose could provide a good proxy for the total number of fully vaccinated people
We can leverage the doses_admin column to calculate some stats about the total number of doses provided in a given period or overall.
Combining the vaccine and new cases datasets could provide insights into vaccine adoption and effectiveness.

Country level summary

Before going back and diving into the US data, let's summarize the data by country. Since the data is cumulative, let's take a snapshot of the most recent date in the data and calculate the ratio between the total number of doses and the size of the population:

max(covid19_vaccine$date)

df_summary <- covid19_vaccine |>
  filter(date == max(date)) |>
  select(date, country_region, doses_admin, total = people_at_least_one_dose, population, continent_name) |>
  mutate(doses_pop_ratio = doses_admin / population,
         total_pop_ratio = total / population) |>
  arrange(- total)

head(df_summary, 10)

It would be interesting to plot the ratio between the number of overall doses received and the size of the population. Let’s filter missing values and plot the ratio:

df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) |>
  plot_ly(x = ~ population,
          y = ~ total,
          text = ~ paste("Country: ", country_region, "<br>",
                         "Population: ", population, "<br>",
                         "Total Doses: ", total, "<br>",
                         "Ratio: ", round(total_pop_ratio, 2), 
                         sep = ""),
          type = "scatter",
          mode = "markers") |>
  layout(title = "Total Doses vs. Population",
         margin = list(l = 50, r = 50, b = 60, t = 70),
         yaxis = list(title = "Number of Doses"),
         xaxis = list(title = "Population Size"))

As can see, China and India are skewing the plot. We can re-scale the plot with log transformation of the number of doses and population size using Plotly’s built-in log transformation. Let’s also color by continent and add diagonal lines to represent the 1 to 1 relationship between the median and first quartile:

# Setting the diagonal lines range
line_start <- 10000
line_end <- 1500 * 10 ^ 6

# Filter the data
d <- df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) 


# Replot it
plot_ly() |>
  add_markers(x = d$population,
              y = d$total,
              text = ~ paste("Country: ", d$country_region, "<br>",
                             "Population: ", d$population, "<br>",
                             "Total Doses: ", d$total, "<br>",
                             "Ratio: ", round(d$total_pop_ratio, 2), 
                             sep = ""),
              color = d$continent_name,
              type = "scatter",
              mode = "markers") |>
  add_lines(x = c(line_start, line_end),
            y = c(line_start, line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_lines(x = c(line_start, line_end),
            y = c(0.5 * line_start, 0.5 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>

  add_lines(x = c(line_start, line_end),
            y = c(0.25 * line_start, 0.25 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_annotations(text = "1:1",
                  x = log10(line_end * 1.25),
                  y = log10(line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:2",
                  x = log10(line_end * 1.25),
                  y = log10(0.5 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:4",
                  x = log10(line_end * 1.25),
                  y = log10(0.25 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "Source: Johns Hopkins University - Centers for Civic Impact",
                  showarrow = FALSE,
                  xref = "paper",
                  yref = "paper",
                  x = -0.05, y = - 0.33) |>
  layout(title = "Covid19 Vaccine - Total Doses vs. Population Ratio (Log Scale)",
         margin = list(l = 50, r = 50, b = 90, t = 70),
         yaxis = list(title = "Number of Doses",
                      type = "log"),
         xaxis = list(title = "Population Size",
                      type = "log"),
         legend = list(x = 0.75, y = 0.05))

Another way to represent the data is by using a box plot to plot the total doses and population ration distribution by continent:

plot_ly(d,
        y = ~ total_pop_ratio,
        color = ~ continent_name,
        type = "box", boxpoints = "all", jitter = 0.3,
        pointpos = -1.8
        ) |>
  layout(title = "Distribution of Total Doses Admited vs Population Size Ratio",
         margin = list(l = 50, r = 50, b = 60, t = 60),
         yaxis = list(title = "Total Doses/Population Ratio"),
         xaxis = list(title = "Continent")

  )

Plot US cases

Let's now return to the number of doses in the US and plot the daily number of new cases with the daily number of vaccine doses received. We will use the people_at_least_one_dose column as a proxy for the number of people vaccinated. First, let's reformat the data from cumulative to daily by subtracting from the cumulative count its first lag:

us_vaccine <- covid19_vaccine |> 
  filter(country_region == "US") |>
  select(date, total_doses = people_at_least_one_dose) |>
  mutate(total_doses_lag1 = lag(total_doses, 1),
         daily_doses = total_doses - total_doses_lag1)|>
  select(date, daily_doses) |>
  arrange(date) 


head(us_vaccine)

Let's now plot the daily doses:

plot_ly(us_vaccine) |>
  add_lines(x = ~ date,
            y = ~ daily_doses)  |>
  layout(title = "US Daily Vaccine Doses Received",
         margin = list(l = 50, r = 50, b = 60, t = 60),
         yaxis = list(title = "Total Doses"),
         xaxis = list(title = ""))

We can now add the Covid19 cases by loading the dataset using the refresh_coronavirus_jhu function, filter it (to US), and merge it with the us_vaccine dataset we created above:

covid19_cases <- refresh_coronavirus_jhu()

head(covid19_cases)

us_cases <- covid19_cases |>
  filter(location == "US",
         data_type == "cases_new") |>
  select(date, cases = value)


head(us_cases)

Let's now merge the two datasets and plot them:

df <- us_vaccine |> 
  left_join(us_cases, by = "date") |>
  select(date, daily_doses, daily_cases = cases) |>
  arrange(date)

head(df)


plot_ly(df) |>
  add_lines(x = ~ date,
            y = ~ daily_cases,
            name = "Daily Cases") |>
  add_lines(x = ~ date,
            y = ~ daily_doses,
            name = "Daily Doses") |>
  layout(title = "US Daily New Cases vs. Doses of Vaccine",
         yaxis = list(title = "New Cases/Doses"),
         xaxis = list(title = ""),
         legend = list(x = 0.75, y = 0.95),
         margin = list(l = 50, r = 50, b = 50, t = 90))