knitr::opts_chunk$set(
  fig.height=5, fig.width=8, message=FALSE, warning=FALSE,
  collapse = TRUE,
  comment = "#>"
)

This vignette provides an overview to the covid19_vaccine dataset.

Data

Before getting started, please note the following:

Data sources:

Load the data

Let's start by loading the packages we will use in this vignette:

library(coronavirus)
library(dplyr)
library(plotly)

We load the dataset from the coronavirus package, and use the dplyr and plotly packages to manipulate and plot the data. Let's load the data:

data("covid19_vaccine")

head(covid19_vaccine)

The covid19_vaccine dataset provides time-series data on the vaccination progress by country or province (if applicable). Likewise, the coronavirus dataset, the COVID19 vaccine raw data is collected by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE). The covid19_vaccine data includes the following fields:

Note: The country / province code fields (e.g., ios2, ios3, etc.) and population were merged with the raw data

Refresh the data

The data in the package is up-to-date with the last date the package was updated on CRAN. To get the most recent data available on the source (data get updated daily), you can use one of the following methods:

library(readr)

vaccine_df <- read_csv(file = "https://raw.githubusercontent.com/RamiKrispin/coronavirus/main/csv/covid19_vaccine.csv",
                       col_types = cols(
                         date = col_date(format = ""),
                         country_region = col_character(),
                         continent_name = col_character(),
                         continent_code = col_character(),
                         combined_key = col_character(),
                         doses_admin = col_number(),
                         people_at_least_one_dose = col_number(),
                         population = col_number(),
                         uid = col_number(),
                         iso2 = col_character(),
                         iso3 = col_character(),
                         code3 = col_number(),
                         fips = col_logical(),
                         lat = col_number(),
                         long = col_number()
                       )) 

# Fixing the continent code field - changing NA to "NA" for North America
vaccine_df$continent_code <- ifelse(is.na(vaccine_df$continent_code) & vaccine_df$continent_name == "North America", "NA", vaccine_df$continent_code)

head(vaccine_df)

The plan, in the next version, is to add a similar function to the refresh_coronavirus_jhu function to pull the vaccine dataset.

Analysing the vaccine dataset

The dataset includes two cumulative counts of vaccine doses received:

So, for example, if a person received a multi doses vaccine such as the Pfizer or Moderna vaccine, here is the count by each method:

We can assume that the last case is rare; therefore, the people_at_least_one_dose could be a good proxy for fully vaccinated people. Let’s see the differences between the two by using the US vaccine data:

covid19_vaccine |> 
  filter(country_region == "US") |>
  select(date, doses_admin, total = people_at_least_one_dose) |>
  arrange(date) |>
  plot_ly() |>
  add_lines(x = ~ date,
            y = ~ doses_admin,
            name = "Total Doses") |>
  add_lines(x = ~ date,
            y = ~ total,
            name = "Recevied Vaccine with One/Two Doses") |>
  layout(title = "US Cumulative Number of Doses by Type",
         legend = list(x = 0.05, y = 0.95),
         margin = list(l = 50, r = 50, b = 50, t = 90),
         yaxis = list(title = "Number of Doses"),
         xaxis = list(title = ""))

Here are some thoughts about how to leverage this dataset:

Country level summary

Before going back and diving into the US data, let's summarize the data by country. Since the data is cumulative, let's take a snapshot of the most recent date in the data and calculate the ratio between the total number of doses and the size of the population:

max(covid19_vaccine$date)

df_summary <- covid19_vaccine |>
  filter(date == max(date)) |>
  select(date, country_region, doses_admin, total = people_at_least_one_dose, population, continent_name) |>
  mutate(doses_pop_ratio = doses_admin / population,
         total_pop_ratio = total / population) |>
  arrange(- total)

head(df_summary, 10)

It would be interesting to plot the ratio between the number of overall doses received and the size of the population. Let’s filter missing values and plot the ratio:

df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) |>
  plot_ly(x = ~ population,
          y = ~ total,
          text = ~ paste("Country: ", country_region, "<br>",
                         "Population: ", population, "<br>",
                         "Total Doses: ", total, "<br>",
                         "Ratio: ", round(total_pop_ratio, 2), 
                         sep = ""),
          type = "scatter",
          mode = "markers") |>
  layout(title = "Total Doses vs. Population",
         margin = list(l = 50, r = 50, b = 60, t = 70),
         yaxis = list(title = "Number of Doses"),
         xaxis = list(title = "Population Size"))

As can see, China and India are skewing the plot. We can re-scale the plot with log transformation of the number of doses and population size using Plotly’s built-in log transformation. Let’s also color by continent and add diagonal lines to represent the 1 to 1 relationship between the median and first quartile:

# Setting the diagonal lines range
line_start <- 10000
line_end <- 1500 * 10 ^ 6

# Filter the data
d <- df_summary |> 
  filter(country_region != "World", 
         !is.na(population),
         !is.na(total)) 


# Replot it
plot_ly() |>
  add_markers(x = d$population,
              y = d$total,
              text = ~ paste("Country: ", d$country_region, "<br>",
                             "Population: ", d$population, "<br>",
                             "Total Doses: ", d$total, "<br>",
                             "Ratio: ", round(d$total_pop_ratio, 2), 
                             sep = ""),
              color = d$continent_name,
              type = "scatter",
              mode = "markers") |>
  add_lines(x = c(line_start, line_end),
            y = c(line_start, line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_lines(x = c(line_start, line_end),
            y = c(0.5 * line_start, 0.5 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>

  add_lines(x = c(line_start, line_end),
            y = c(0.25 * line_start, 0.25 * line_end),
            showlegend = FALSE,
            line = list(color = "gray", width = 0.5)) |>
  add_annotations(text = "1:1",
                  x = log10(line_end * 1.25),
                  y = log10(line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:2",
                  x = log10(line_end * 1.25),
                  y = log10(0.5 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "1:4",
                  x = log10(line_end * 1.25),
                  y = log10(0.25 * line_end * 1.25),
                  showarrow = FALSE,
                  textangle = -25,
                  font = list(size = 8),
                  xref = "x",
                  yref = "y") |>
  add_annotations(text = "Source: Johns Hopkins University - Centers for Civic Impact",
                  showarrow = FALSE,
                  xref = "paper",
                  yref = "paper",
                  x = -0.05, y = - 0.33) |>
  layout(title = "Covid19 Vaccine - Total Doses vs. Population Ratio (Log Scale)",
         margin = list(l = 50, r = 50, b = 90, t = 70),
         yaxis = list(title = "Number of Doses",
                      type = "log"),
         xaxis = list(title = "Population Size",
                      type = "log"),
         legend = list(x = 0.75, y = 0.05))

Another way to represent the data is by using a box plot to plot the total doses and population ration distribution by continent:

plot_ly(d,
        y = ~ total_pop_ratio,
        color = ~ continent_name,
        type = "box", boxpoints = "all", jitter = 0.3,
        pointpos = -1.8
        ) |>
  layout(title = "Distribution of Total Doses Admited vs Population Size Ratio",
         margin = list(l = 50, r = 50, b = 60, t = 60),
         yaxis = list(title = "Total Doses/Population Ratio"),
         xaxis = list(title = "Continent")

  )

Plot US cases

Let's now return to the number of doses in the US and plot the daily number of new cases with the daily number of vaccine doses received. We will use the people_at_least_one_dose column as a proxy for the number of people vaccinated. First, let's reformat the data from cumulative to daily by subtracting from the cumulative count its first lag:

us_vaccine <- covid19_vaccine |> 
  filter(country_region == "US") |>
  select(date, total_doses = people_at_least_one_dose) |>
  mutate(total_doses_lag1 = lag(total_doses, 1),
         daily_doses = total_doses - total_doses_lag1)|>
  select(date, daily_doses) |>
  arrange(date) 


head(us_vaccine)

Let's now plot the daily doses:

plot_ly(us_vaccine) |>
  add_lines(x = ~ date,
            y = ~ daily_doses)  |>
  layout(title = "US Daily Vaccine Doses Received",
         margin = list(l = 50, r = 50, b = 60, t = 60),
         yaxis = list(title = "Total Doses"),
         xaxis = list(title = ""))

We can now add the Covid19 cases by loading the dataset using the refresh_coronavirus_jhu function, filter it (to US), and merge it with the us_vaccine dataset we created above:

covid19_cases <- refresh_coronavirus_jhu()

head(covid19_cases)

us_cases <- covid19_cases |>
  filter(location == "US",
         data_type == "cases_new") |>
  select(date, cases = value)


head(us_cases)

Let's now merge the two datasets and plot them:

df <- us_vaccine |> 
  left_join(us_cases, by = "date") |>
  select(date, daily_doses, daily_cases = cases) |>
  arrange(date)

head(df)


plot_ly(df) |>
  add_lines(x = ~ date,
            y = ~ daily_cases,
            name = "Daily Cases") |>
  add_lines(x = ~ date,
            y = ~ daily_doses,
            name = "Daily Doses") |>
  layout(title = "US Daily New Cases vs. Doses of Vaccine",
         yaxis = list(title = "New Cases/Doses"),
         xaxis = list(title = ""),
         legend = list(x = 0.75, y = 0.95),
         margin = list(l = 50, r = 50, b = 50, t = 90))


RamiKrispin/coronavirus documentation built on March 22, 2023, 7:27 p.m.