knitr::opts_chunk$set( fig.height=5, fig.width=8, message=FALSE, warning=FALSE, collapse = TRUE, comment = "#>" )
This vignette provides an overview to the covid19_vaccine
dataset.
Before getting started, please note the following:
Data sources:
Let's start by loading the packages we will use in this vignette:
library(coronavirus) library(dplyr) library(plotly)
We load the dataset from the coronavirus package, and use the dplyr and plotly packages to manipulate and plot the data. Let's load the data:
data("covid19_vaccine") head(covid19_vaccine)
The covid19_vaccine
dataset provides time-series data on the vaccination progress by country or province (if applicable). Likewise, the coronavirus
dataset, the COVID19 vaccine raw data is collected by Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE). The covid19_vaccine
data includes the following fields:
date
- Data collection date in YYYY-MM-DD formatcountry_region
- Country or region namecontinent_name
- Continent namecontinent_code
- Continent codecombined_key
- Country and province (if applicable)doses_admin
- Cumulative number of doses administered. When a vaccine requires multiple doses, each one is counted independentlypeople_at_least_one_dose
- Cumulative number of people who received at least one vaccine dose. When the person receives a prescribed second dose, it is not counted twicepopulation
- Country or province populationuid
- Country codeiso2
- Officially assigned country code identifiers with two-letteriso3
- Officially assigned country code identifiers with three-lettercode3
- UN country codefips
- Federal Information Processing Standards code that uniquely identifies counties within the USAlat
- Latitudelong
- LongitudeNote: The country / province code fields (e.g., ios2
, ios3
, etc.) and population were merged with the raw data
The data in the package is up-to-date with the last date the package was updated on CRAN. To get the most recent data available on the source (data get updated daily), you can use one of the following methods:
update_dataset
which will re-install the package from the package repository. Please see the vignette for more details library(readr) vaccine_df <- read_csv(file = "https://raw.githubusercontent.com/RamiKrispin/coronavirus/main/csv/covid19_vaccine.csv", col_types = cols( date = col_date(format = ""), country_region = col_character(), continent_name = col_character(), continent_code = col_character(), combined_key = col_character(), doses_admin = col_number(), people_at_least_one_dose = col_number(), population = col_number(), uid = col_number(), iso2 = col_character(), iso3 = col_character(), code3 = col_number(), fips = col_logical(), lat = col_number(), long = col_number() )) # Fixing the continent code field - changing NA to "NA" for North America vaccine_df$continent_code <- ifelse(is.na(vaccine_df$continent_code) & vaccine_df$continent_name == "North America", "NA", vaccine_df$continent_code) head(vaccine_df)
The plan, in the next version, is to add a similar function to the refresh_coronavirus_jhu
function to pull the vaccine dataset.
The dataset includes two cumulative counts of vaccine doses received:
people_at_least_one_dose
- counts both single and multiple doses of vaccine as one dosedoses_admin
- counts each dose received separatelySo, for example, if a person received a multi doses vaccine such as the Pfizer or Moderna vaccine, here is the count by each method:
people_at_least_one_dose
will count it as one dose, and the doses_admin
will count it as twoWe can assume that the last case is rare; therefore, the people_at_least_one_dose
could be a good proxy for fully vaccinated people. Let’s see the differences between the two by using the US vaccine data:
covid19_vaccine |> filter(country_region == "US") |> select(date, doses_admin, total = people_at_least_one_dose) |> arrange(date) |> plot_ly() |> add_lines(x = ~ date, y = ~ doses_admin, name = "Total Doses") |> add_lines(x = ~ date, y = ~ total, name = "Recevied Vaccine with One/Two Doses") |> layout(title = "US Cumulative Number of Doses by Type", legend = list(x = 0.05, y = 0.95), margin = list(l = 50, r = 50, b = 50, t = 90), yaxis = list(title = "Number of Doses"), xaxis = list(title = ""))
Here are some thoughts about how to leverage this dataset:
people_at_least_one_dose
could provide a good proxy for the total number of fully vaccinated people doses_admin
column to calculate some stats about the total number of doses provided in a given period or overall. Before going back and diving into the US data, let's summarize the data by country. Since the data is cumulative, let's take a snapshot of the most recent date in the data and calculate the ratio between the total number of doses and the size of the population:
max(covid19_vaccine$date) df_summary <- covid19_vaccine |> filter(date == max(date)) |> select(date, country_region, doses_admin, total = people_at_least_one_dose, population, continent_name) |> mutate(doses_pop_ratio = doses_admin / population, total_pop_ratio = total / population) |> arrange(- total) head(df_summary, 10)
It would be interesting to plot the ratio between the number of overall doses received and the size of the population. Let’s filter missing values and plot the ratio:
df_summary |> filter(country_region != "World", !is.na(population), !is.na(total)) |> plot_ly(x = ~ population, y = ~ total, text = ~ paste("Country: ", country_region, "<br>", "Population: ", population, "<br>", "Total Doses: ", total, "<br>", "Ratio: ", round(total_pop_ratio, 2), sep = ""), type = "scatter", mode = "markers") |> layout(title = "Total Doses vs. Population", margin = list(l = 50, r = 50, b = 60, t = 70), yaxis = list(title = "Number of Doses"), xaxis = list(title = "Population Size"))
As can see, China and India are skewing the plot. We can re-scale the plot with log transformation of the number of doses and population size using Plotly’s built-in log transformation. Let’s also color by continent and add diagonal lines to represent the 1 to 1 relationship between the median and first quartile:
# Setting the diagonal lines range line_start <- 10000 line_end <- 1500 * 10 ^ 6 # Filter the data d <- df_summary |> filter(country_region != "World", !is.na(population), !is.na(total)) # Replot it plot_ly() |> add_markers(x = d$population, y = d$total, text = ~ paste("Country: ", d$country_region, "<br>", "Population: ", d$population, "<br>", "Total Doses: ", d$total, "<br>", "Ratio: ", round(d$total_pop_ratio, 2), sep = ""), color = d$continent_name, type = "scatter", mode = "markers") |> add_lines(x = c(line_start, line_end), y = c(line_start, line_end), showlegend = FALSE, line = list(color = "gray", width = 0.5)) |> add_lines(x = c(line_start, line_end), y = c(0.5 * line_start, 0.5 * line_end), showlegend = FALSE, line = list(color = "gray", width = 0.5)) |> add_lines(x = c(line_start, line_end), y = c(0.25 * line_start, 0.25 * line_end), showlegend = FALSE, line = list(color = "gray", width = 0.5)) |> add_annotations(text = "1:1", x = log10(line_end * 1.25), y = log10(line_end * 1.25), showarrow = FALSE, textangle = -25, font = list(size = 8), xref = "x", yref = "y") |> add_annotations(text = "1:2", x = log10(line_end * 1.25), y = log10(0.5 * line_end * 1.25), showarrow = FALSE, textangle = -25, font = list(size = 8), xref = "x", yref = "y") |> add_annotations(text = "1:4", x = log10(line_end * 1.25), y = log10(0.25 * line_end * 1.25), showarrow = FALSE, textangle = -25, font = list(size = 8), xref = "x", yref = "y") |> add_annotations(text = "Source: Johns Hopkins University - Centers for Civic Impact", showarrow = FALSE, xref = "paper", yref = "paper", x = -0.05, y = - 0.33) |> layout(title = "Covid19 Vaccine - Total Doses vs. Population Ratio (Log Scale)", margin = list(l = 50, r = 50, b = 90, t = 70), yaxis = list(title = "Number of Doses", type = "log"), xaxis = list(title = "Population Size", type = "log"), legend = list(x = 0.75, y = 0.05))
Another way to represent the data is by using a box plot to plot the total doses and population ration distribution by continent:
plot_ly(d, y = ~ total_pop_ratio, color = ~ continent_name, type = "box", boxpoints = "all", jitter = 0.3, pointpos = -1.8 ) |> layout(title = "Distribution of Total Doses Admited vs Population Size Ratio", margin = list(l = 50, r = 50, b = 60, t = 60), yaxis = list(title = "Total Doses/Population Ratio"), xaxis = list(title = "Continent") )
Let's now return to the number of doses in the US and plot the daily number of new cases with the daily number of vaccine doses received. We will use the people_at_least_one_dose
column as a proxy for the number of people vaccinated. First, let's reformat the data from cumulative to daily by subtracting from the cumulative count its first lag:
us_vaccine <- covid19_vaccine |> filter(country_region == "US") |> select(date, total_doses = people_at_least_one_dose) |> mutate(total_doses_lag1 = lag(total_doses, 1), daily_doses = total_doses - total_doses_lag1)|> select(date, daily_doses) |> arrange(date) head(us_vaccine)
Let's now plot the daily doses:
plot_ly(us_vaccine) |> add_lines(x = ~ date, y = ~ daily_doses) |> layout(title = "US Daily Vaccine Doses Received", margin = list(l = 50, r = 50, b = 60, t = 60), yaxis = list(title = "Total Doses"), xaxis = list(title = ""))
We can now add the Covid19 cases by loading the dataset using the refresh_coronavirus_jhu
function, filter it (to US), and merge it with the us_vaccine
dataset we created above:
covid19_cases <- refresh_coronavirus_jhu() head(covid19_cases) us_cases <- covid19_cases |> filter(location == "US", data_type == "cases_new") |> select(date, cases = value) head(us_cases)
Let's now merge the two datasets and plot them:
df <- us_vaccine |> left_join(us_cases, by = "date") |> select(date, daily_doses, daily_cases = cases) |> arrange(date) head(df) plot_ly(df) |> add_lines(x = ~ date, y = ~ daily_cases, name = "Daily Cases") |> add_lines(x = ~ date, y = ~ daily_doses, name = "Daily Doses") |> layout(title = "US Daily New Cases vs. Doses of Vaccine", yaxis = list(title = "New Cases/Doses"), xaxis = list(title = ""), legend = list(x = 0.75, y = 0.95), margin = list(l = 50, r = 50, b = 50, t = 90))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.