knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library("revolution")
update_rki_data(method="auto")

Loading and updating the underlying data

The "revolution" package offers the possibility to manualy update the underlying data, so that the plots of time series like the 7-day-incidence or the vaccination data are always up to date. To update official data provided by the Robert Koch Institut, the function update_rki_data() can be called.

As pretty much all functions are based on that data set, it is recommended to keep ontop of recent changes. However, the data provided by the RKI is sometimes broken. Thus, if the package starts to behave strange, try to update the data the next day.

The package also includes the option to manually load an RKI data set. For this we refere to the manual_load_rki_data() function and its documentation.

To update the data for vaccine-related tibbles or plots, the function update_vac_data() can be called.

Furthermore, the RKI offers data for "variants of concerns", as for example the delta-variant. To update this data set, call update_voc_data(). This speciic data set however has not been updated since mid July. We think it's due to delta's dominance, and the resulting redundance of the data.

The official RKI data set

Most of the functions provided by "revolution" obtain their data from the official RKI csv file. Unfortunately, the data set is not without errors and it heavily relies on the reporting capabilities of the local health offices. Thus the number of reported cases fluctuates wildly during a week, as less test are taken and reported on weekends.

The RKI data set lists each known case as a unique row. For about half of them, the data set also lists the infection date. These are assum to be traced cases.

The function calc_traced_cases() returns a time series that contains the total and relative amount of traced cases for each day since 2020. Specific regions and age groups may be specified.

calc_traced_cases(ages="all",region="Baden-Württemberg",from="2020-03-19")

Based on calc_traced_cases(), the two plot functions plot_traced_cases_percentage() and plot_traced_cases_total() visualise the upper tibble.

plot_traced_cases_percentage(ages="all",region="Baden-Württemberg",from="2020-03-19")

The function calc_distribution_report_diff() calculates the lag between infection and reporting for each traced case. This is visualised by plot_distribution_report_diff():

plot_distribution_report_diff()

As seen above, most cases were reported just a few days after the assumed infection date, but a significant share of cases fall outside of that window. In approximately more than 30 000 cases, the reporting date is even before the assumed infection date. We assume that this is due to human error, not planning.

In conclusion, the data needs to be interpreted and analyized carefully.

COVID-19 associated deaths and cases

The RKI data is huge (more than 2 500 000 rows and more than 300 MB that are only made up of letters, digits and commas). Thus, it offers countless possibilities of statistical analysis, some of the most famous ones and/or useful are included in this package. These function often come in pairs: One delivers the time series (a tibble), the other returns the belonging plot to visualise the time series.

One of the most important functions in this context is the get_data_for(). This function allows various extractions of the RKI data. The user can define the desired age group, region and the beginning and the end of the time series that get_data_for() returns:

get_data_for(regions=c("Heidelberg","Mannheim"),ages=c(15,52,84),from="2020-12-02",to="2021-08-05",type="cases")

The belonging plot function is plot_data_for(), thus the plot for the example above looks as follows:

plot_data_for(regions=c("Heidelberg","Mannheim"),ages=c(15,52,84),from="2020-12-02",to="2021-08-05",type="cases")

In this example, the data contains the absolute number of cases, but get_data_for() and plot_data_for() also offer the possibility to get the 7-day-incidence or the number of COVID-19 associated deaths.

Similar functions are get_sti_series_for() and get_time_series_for(). get_sti_series_for() returns a time series of the 7-day-incidence or a "7-day-death-incidence". get_time_series_for() returns a time series for the absolute number of cases and COVID-19 associated deaths. In contrast to get_data_for(), it delivers a single tibble.

The package also offers a visualization of the age distribution during the pandemic: plot_for_agegroups() can be called with type="cases" or with type="sti".

plot_for_agegroups(type="cases")

During the pandemic, several virus variants were found. They may differ in infectiousness or mortality. Those, which seem to be especially dangerous are the so called "Variants of concern". To show their influence, "revolution" has the functions variant_case_time_series(), variant_case_r_value() and variant_sti_time_series(). These functions all return tibbles that contain the data that the function names indicate for each variant of concern. The function plot_variants() delivers the belonging plots.

Unfortunately, the official RKI table only contains weekly data. Hence, plot_variants() tries to interpolate the weekly data points to estimate the daily share of the different variants. plot_variants()is able to plot either the absolute number, the r-value, the 7-day-incidence or the share of each single variant.

plot_variants(interpolation="linear",type="share")

Factors influencing the spread

The way COVID-19 spread during the last few months is diffuse. One of the biggest problems with identifying influential factors is the large number of unknown cases, as well as the never resting mantra of "correlation not impying causality". There is a huge amount of possible factors (seasonality and weather, travel behaviour of the population, mandatory mask wearing, lockdown, etc.), influencing the pandemic. "revolution" takes a closer look at three factors: the income, the geographical proximity between districts and the population density of districts.

To study the correlation between income and COVID-19 incidence, the function income_sti_correlation() comes in handy. It returns a time series of correlation coefficients between the average income of districts and the 7-day-incidence. At first glance, the result seem promising and significant. But after a more precise study, it turned out that the correlation between income and 7-day-incidence mostly depends on the always present East-West divide in Germany.

To check the other assumption that geographical proximity plays an important role in the spread of COVID-19 (caused by super spreader events and such), the function calc_sti_correlation_of_lks() returns a tibble that contains the correlation coefficients between the 7-days-incidence of all district combinations. Thus a call without specification is rather expensive. The user may however specify a subset as follows:

calc_sti_correlation_of_lks(c("Heidelberg","Mannheim","Hamburg"))

All in all, all of the districts show a high correlation due to the complete saturation in Germany. Furthermore, the correlation needs to be handled carefully, as a hotspot in Hamburg and Munich during the same time period would lead to a high correlation, even though the origin of the outbreaks could be totally different.

The third important function in this context is get_pop_density_with_sti() (and for plotting with a linear model: plot_pop_density_with_linear_model()). The former returns a tibble of the 7-days-incidence and the population density of a specified region, the latter delivers a linear model in order to identify a possible correlation between population density and the 7-days-incidence.

plot_pop_density_with_linear_model(c("Heidelberg","Mannheim","Nordfriesland","Schleswig-Flensburg","Hamburg","München"))

This data evaluation delivers similar results as the other analsyis functions: a clear correlation cannot be identified. Of course, one can find highly populated districts that also show high COVID-19, but the correlation is highly sensitive to the regions chosen. All in all, there are simply too many factors influencing the pandemic, that are not covered by these three functions, making "revolution" not sufficient for a deep study of different factors.

COVID-19 vaccine data

In December of 2020, the deciding factor vaccination was introduced to te fight. "revolution" offers the functionality to show the current vaccination progress.

"revolution" has the function get_vaccination_data() (and for plotting: plot_vaccination_data()). It is possible to specify the desired age group, the districts and states, the time period, the vaccination status (such that data is filtered by first or second vaccination) to sum up the vaccination numbers to get cumulated figures.

get_vaccination_data(ages="18-59",regions=c("Niedersachsen","Sachsen"),from="2020-12-26",to="2021-08-07",vac_num=1,cumulate=F)

The belonging plot looks like this:

plot_vaccination_data(ages="18-59",regions=c("Niedersachsen","Sachsen"),from="2020-12-26",to="2021-08-07",vac_num=1,cumulate=F)

Some of the plotting functions have the possibility to smooth the time series. Always check the documentation to see whether smoothing is possible or not.

plot_vaccination_data(ages="18-59",regions=c("Niedersachsen","Sachsen"),from="2020-12-26",to="2021-08-07",vac_num=1,cumulate=F,smoothing=3)

Deaths and excess mortality

In total, there are more than 90 000 COVID-19 associated deaths in Germany (in August 2021). This number is far higher than the documented influenza associated deaths. To study the effect of COVID-19 on the general mortality in Germany, "revolution" provides several functions.

Firstly, the function calc_covid_mortality() is useful in order to clearly identify the most vulnerable age groups. This function calculates the mortality of COVID-19, and thus returns the quotient of deaths and infections per age group for each day.

calc_covid_mortality(ages=80, regions="Baden-Württemberg",from="2020-03-15")

The belonging plot looks like this:

plot_covid_mortality(ages=80, regions="Baden-Württemberg",from="2020-03-15",smoothing=2)

The plot shows that in the initial stages, the mortal of COVID-19 was higher than it is now. This aspect may be explained by the untracked cases, and by improvements in treatment. The effects of the vaccination are also visible. The mortality has been sinking since the second quarter of 2021.

The function get_abs_deaths() returns the absolute number of daily deaths in Germany. The user decides, which years should be returned, the earliest data is from 2000, the latest from 2020.

get_abs_deaths(years=c(2000,2019,2020))

As always, "revolution" provides the belonging function to plot this tibble:

plot_abs_deaths(years=c(2003,2019,2020),smoothing=0)

In contrast to previous years, an exceedingly high number of absolute deaths at the end of the year is significantly recognizable in 2020.

The data set on which the upper function relies on does not contain any information about the number of deaths in different age groups, but there is data on the weekly number of deaths by ages. Unfortunately, the earliest available data is form 2014. The function get_weekly_deaths() prepares that data set and returns a tibble filtered by the desired years and age groups. In addition, it has the feature to divide the weekly number of deaths by the population in order to calculate a "weekly mortality rate". To plot this tibble, call plot_weekly_deaths():

plot_weekly_deaths(c(2018,2019,2020),age="A60-A79")

As expected, the weekly numbers also show a strong rise of deaths at the end of 2020, but for age group "A60-A79", the number of weekly deaths did not increase as much as during the flu epidemic of 2018.

In order to give context to the mortality in 2020, "revolution" provides the function get_total_mortality(). For each year since 2014, it calculates the mortality per age group over the whole year.

plot_total_mortality(age="A80+")

The mortality in 2020 wasn't significantly higher, even though the plot shows the most vulnerable age group "A80+". But it is important to keep in mind, that the total mortality is calculated with accumulated deaths over the whole year. Thus, seasonal effects (COVID-19 may seasonal) may be compensated by a low number of deaths during a different time of year. In 2020 for example, the upper plots show quite a weak flu pandemic at the beginning of the year.

A pretty popular way to measure excess mortality is the comparison between one specific year (for COVID-19, one may chose 2020) and the average deaths per day of the previous years (this is how the "Statistisches Bundesamt" measures excess mortality). To visualize this comparison, the function plot_excess_mortality may be used.

plot_excess_mortality(2020,c(2016,2017,2018,2019),smoothing=1)

Looking at this plot, it becomes rather clear, that winter 2020/2021 showed a significant excess in mortality. At this point it is important to note, that the average must not be calculated with distant years, as demographic changes would distort the comparison:

plot_excess_mortality(2020,c(2000,2001,2002,2003),smoothing=1)


slausmeister/revolution documentation built on Dec. 23, 2021, 3:24 a.m.