In dgolicher/usdata: Just holds some data

knitr::opts_chunk$set(echo = TRUE,message=FALSE,warning=FALSE)

library(tidyverse)
library(RColorBrewer)
library(mapview)
library(sf)
library(COVID19)
library(spdep)
library(tmap)
library(usdata)
library(lubridate)
theme_set(theme_bw())

This is an unfinished draft analysis of a very sensitive subject. Do not cite or use any results until they have been reviewed and refined.

Introduction

One of the commonest criticisms of media reporting on the pandemic has been the use of a cumulative death count as a measure of progress of the disease. This has led to the public perception of ever increasing mortality. The number has been used without a reference denominator. At the very least the number of deaths that would be expected to occur naturally over any given period should have been included.

In the highly charged political atmosphere of the United States the raw death toll has been particularly unhelpful as a metric. There has been a tendency to use the raw number as a yardstick to measure the performance of state governors. One difficulty with this that has been widely overlooked is that the baseline mortality differs greatly between individual states.

Baseline mortality

Crude state wide mortality rates: Years 2011 to 2019

In order to establish a suitable baseline for comparison with the Covid-19 associated mortality recorded for 2020 I used data taken from the US census. The complete census only takes place once every decade, but a comprehensive series of household surveys are combined with official records and modelling to fill in the gaps [@phe]. Figure 1 shows the results of aggregating the death count by state and year and dividing the total by the state population for the relevant year. This produces the crude death rate (CDR) expressed as deaths per 10 thousand of the population.

The variation between states is much more notable than the weak trend. Crude death rates lie within a range of 75 to 90 deaths per 10 thousand for 50% of the states. There are some notable outliers which leads to an overall range which varies by around a factor of two. West Virginia shows consistent exceptional high mortality while Utah has notably very low intrinsic mortality.

data(census)
d<-census
d$rate<-d$Deaths/(d$Population/10000)
d %>% mutate(lb=ifelse(State %in% c("West Virginia","Utah", "California", "Florida", "New York", "North Dakota","Texas", "Oklahoma") ,State,"")) %>%

  ggplot(aes(x=as.factor(Year), y=rate, label=lb)) + geom_boxplot(alpha=0.2,fill="lightgrey",col="lightgrey") + geom_text(size=2) -> g1

g1 + labs(title = "", 
                     caption = "Source: USA census population estimate \n see https://www.census.gov/programs-surveys/popest.html for details",
                     x = "Year",
                     y="Crude death rate (deaths per 10K)") -> g1
g1

Crude state wide mortality rates: Years 2019 to 2020

In order to triangulate against the data obtained from the national census estimates data was aggregated from the cdc weekly counts of deaths for 2019 and 2020 [@cdc_weekly]. There are some minor differences between the estimates for all cause deaths in 2019 between the two data sets, but the general pattern and numbers are consistent.

data("weeekly_counts")
cdc<-weekly_counts
cdc$Year<-cdc$MMWR_Year
cdc$State<-cdc$Jurisdiction_of_Occurrence

cdc %>% filter(name=="All_Cause") %>% group_by(State,pop,Year) %>% summarise(deaths=sum(value)) %>% mutate(rate=deaths/(pop/10000)) -> all_cause


all_cause %>% mutate(lb=ifelse(State %in% c("West Virginia","Utah", "California", "Florida", "New Jersey", "North Dakota","Texas") ,State,"")) %>%

  ggplot(aes(x=as.factor(Year), y=rate, label=lb)) + geom_boxplot(alpha=0.2,fill="lightgrey",col="lightgrey") + geom_text(size=2) -> g2

g2 + labs(title = "", 
                     caption = "Source: Accessed January 20.\n https://catalog.data.gov/dataset/weekly-counts-of-deaths-by-state-and-select-causes-2019-2020.",
                     x = "Year",
                     y="Crude death rate (deaths per 10K)")

d %>% group_by(State) %>% summarise(crude_death_rate=round(mean(rate),1)) -> cdr
cdr_totals<-cdr

library(RColorBrewer)
colors <- brewer.pal(7, "RdYlGn")[7:1]
cdr_geo<-merge(us,cdr)
# mapview(cdr_geo, zcol="crude_death_rate", col.regions = colors, layer.name = "Deaths per 10K",
#              at = round(quantile(cdr$crude_death_rate,c(0,0.2,0.4,0.6,0.8,1)),1)) 
# qtm(cdr_geo, fill="crude_death_rate",fill.pallete = colors, breaks=round(quantile(cdr$crude_death_rate,c(0,0.2,0.4,0.6,0.8,1)),1), legend.outside=TRUE)

map1 <- tm_shape(cdr_geo)+ tm_style_white() +
  tm_fill("crude_death_rate", 
          breaks = round(quantile(cdr$crude_death_rate,c(0,0.1,0.2,0.4,0.6,0.8,0.9,1)),1),
          style = "fixed",
          textNA = "No data",
          colorNA = "white", 
          palette = colors )+ tm_text("State", size=0.3) +
  tm_borders() + tm_legend(outside=TRUE) 
#map1

The spatial pattern of crude death rate shown in figure 2 shows a clear pattern. States on both the North East coast and the West tend to have low CDRs, while the Appalachian and Southern States have high values.

map1

Adjustment for age

A fully comprehensive adjustment of the CDR would require the calculation of age specific death rates for all demographic bands for each state combined with data on the number of deaths for each of the age bands. A less detailed approach that results in an adjustment that follows the same overall pattern can be achieved by regressing the crude death rate on the proportion of the population over 70 and then extracting the residual variability. The analysis is shown in figure 3. States with CDRs that fall above the fitted regression line have more deaths than would be predicted based on the proportion of elderly residents. States that fall below the line have fewer deaths than expected.

The regression itself is a good predictor of overall crude death rates with an $R^2$ value of 0.43. The position of Florida, falling well below the trend line, suggests that the overall health of the Florida population may be better than would be expected given the age profile for the state as a whole.

cdr<-merge(cdr,acs)
cdr %>% ggplot(aes(x=percent_over_70, y=crude_death_rate, label=State)) + geom_smooth(method="lm",alpha=0.1,col="lightgrey") + geom_text(size=2) -> g2

g2<- g2 + labs(title = "", 
                     caption = "Source: USA census population estimate \n see https://www.census.gov/programs-surveys/popest.html for details",
                     x = "Percent of state population over 70",
                     y="Crude death rate (deaths per 10K)") 
g2

mod<-lm(data=cdr,crude_death_rate~percent_over_70)
cdr$adjusted_cdr<-residuals(mod)

cdr_geo<-merge(us,cdr)
# mapview(cdr_geo, zcol="adjusted_cdr", col.regions = colors, layer.name = "adjusted_cdr per 10K",
#              at = round(quantile(cdr$adjusted_cdr,c(0,0.2,0.4,0.6,0.8,1)),1))

map2 <- tm_shape(cdr_geo)+ tm_style_white() +
  tm_fill("adjusted_cdr", 
          breaks = round(quantile(cdr$adjusted_cdr,c(0,0.1,0.2,0.4,0.6,0.8,0.9,1)),1),
          style = "fixed",
          textNA = "No data",
          colorNA = "white", 
          palette = colors )+
  tm_borders() + tm_legend(outside=TRUE) 
#map2

colors2 <- brewer.pal(7, "Reds")
map2a <- tm_shape(cdr_geo)+ tm_style_white() +
  tm_fill("percent_over_70", 
          breaks = round(quantile(cdr$percent_over_70,c(0,0.1,0.2,0.4,0.6,0.8,0.9,1)),1),
          style = "fixed",
          textNA = "No data",
          colorNA = "white", 
          palette = colors2 ) + 
  tm_borders() + tm_legend(outside=TRUE) 
#map2a

Adjustment for household income

States with the highest crude death rates, such as West Virginia and Oklahoma, also have the lowest household incomes. Poverty is generally assumed to be linked to poor health outcomes. The median household income is a good proxy indicator for a wide range of socio-economic factors. Figure 4 shows the results of regressing the age adjusted crude death rate against median household income. The regression is highly significant (p<0.0001) with an $R^2$ of 0.46.

cdr %>% filter(State !="Puerto Rico")-> cdr

cdr%>% ggplot(aes(x=household_income, y=adjusted_cdr, label=State)) + geom_point(size=0) + geom_smooth(method="lm",alpha=0.1,col="lightgrey") + geom_text(size=2) -> g3


g3<- g3 + labs(title = "", 
                     caption = "Source: USA census population estimate \n see https://www.census.gov/programs-surveys/popest.html for details",
                     x = "Median household income",
                     y="CDR adjusted for population over 70") 
g3

mod<-lm(data=cdr,adjusted_cdr~household_income)
cdr$Re_adjusted_cdr<-residuals(mod)

Explanatory power of demographics and income combined.

The analysis demonstrates that both demographic and socio-economic factors are strongly associated with crude death rates. An additive regression model can be fitted in order to provide a measure of the explanatory power of the two factors combined. The model explained 70% of the variability. After holding for differences in the state-wide demography a difference of ten thousand dollars in the mean state-wide household income is associated with a change of around 5.8 in the state-wide CDR.

mod<-lm(data=cdr,crude_death_rate~household_income+percent_over_70)
summary(mod)

cdr_geo<-merge(us,cdr)
# mapview(cdr_geo, zcol="re_adjusted_cdr", col.regions = colors, layer.name = "re_adjusted_cdr",
#              at = round(quantile(cdr$re_adjusted_cdr,c(0,0.2,0.4,0.6,0.8,1)),1)) 

map3 <- tm_shape(cdr_geo)+ tm_style_white() +
  tm_fill("Re_adjusted_cdr", 
          breaks = round(quantile(cdr$Re_adjusted_cdr,c(0,0.1,0.2,0.4,0.6,0.8,0.9,1)),1),
          style = "fixed",
          textNA = "No data",
          colorNA = "white", 
          palette = colors ) + 
  tm_borders() + tm_legend(outside=TRUE) 
#map3

colors2 <- brewer.pal(7, "PRGn")
map3a <- tm_shape(cdr_geo)+ tm_style_white() +
  tm_fill("household_income", 
          breaks = round(quantile(cdr$household_income,c(0,0.1,0.2,0.4,0.6,0.8,0.9,1)),1),
          style = "fixed",
          textNA = "No data",
          colorNA = "white", 
          palette = colors2 ) + 
  tm_borders() + tm_legend(outside=TRUE) 
#map3a

Mapping the results

library(grid)
grid.newpage()
page.layout <- grid.layout(nrow = 2, ncol = 1)
pushViewport(viewport(layout = page.layout))
print(map2a, vp=viewport(layout.pos.row = 1, layout.pos.col = 1))
print(map3a, vp=viewport(layout.pos.row = 2, layout.pos.col = 1))

library(grid)
grid.newpage()
page.layout <- grid.layout(nrow = 2, ncol = 1)
pushViewport(viewport(layout = page.layout))
print(map2, vp=viewport(layout.pos.row = 1, layout.pos.col = 1))
print(map3, vp=viewport(layout.pos.row = 2, layout.pos.col = 1))

Summary

Baseline crude mortality rates vary by a factor of two between US states.
A large fraction of this variability can be attributed directly to variability in the fraction of the population over 70 years of age.
Independently of the effect of demography, socio-economic differences between states as captured by media household incomes statistically explains a large proportion of the residual variability

Covid-19 mortality

Covid deaths

At the time of writing (r Sys.Date()), nine months have passed since the initial phase of the pandemic. The numbers of deaths attributed to Sars-Cov-2 infections in the USA continues to rise. Any analysis will be provisional and the full impact of the pandemic remains to be documented. The baseline CDR analysis used the sum total of deaths over a twelve month period. Therefore a comparable analysis of Covid-19 attributed deaths can use the current cumulative deaths to date as input.

data("States_COVID19")
d<-States_COVID19
d$population<-aqm::clean(d$population)
d$month<-month(d$date)
d$year<-year(d$date) 
d$covid_month<-d$month+(12*(d$year-2020))

d %>% group_by(State, population) %>% summarise(covid_deaths=max(deaths,na.rm=TRUE)) %>% mutate(covid_cdr=covid_deaths/(population/10000))-> cvd

d %>% group_by(State, population, covid_month) %>% summarise(covid_deaths=max(deaths,na.rm=TRUE)) %>% mutate(covid_cdr=covid_deaths/(population/10000))-> cvd2

cvd2 %>% filter(covid_month> 6) %>% mutate(lb=ifelse(State %in% c("Arizona","Utah", "California", "Florida", "New York", "North Dakota","Texas", "South Dakota", "New Jersey") ,State,"")) %>% 
ggplot(aes(as.factor(covid_month),y=covid_cdr, label=lb)) + geom_boxplot(alpha=0.2,fill="lightgrey",col="lightgrey") + geom_text(size=2) ->cg1
cg1<- cg1 + labs(title = "", 
                     caption = "Source: COVID-19 Case Surveillance Public Use Data",
                     x = "Month",
                     y="Covid-19 deaths per 10k population") 
cg1

Mapping Covid deaths

The cumulative numbers of Covid-19 deaths reported for each state divided by the state population (in 2019) is shown in figure 9

cvd_geo<-merge(us,cvd)
covid_map1 <- tm_shape(cvd_geo)+ tm_style_white() +
  tm_fill("covid_cdr", 
          breaks = round(quantile(cvd$covid_cdr,c(0,0.2,0.4,0.6,0.8,1)),1),
          style = "fixed",
          textNA = "No data",
          colorNA = "white", 
          palette = colors )+ tm_text("State", size=0.3) +
  tm_borders() + tm_legend(outside=TRUE) 
covid_map1

Correlates of Covid-19 mortality

In contrast to the baseline all cause population adjusted mortality the cumulative deaths attributed to Covid-19 per state show no statistically significant relationship with any of the covariates associated with baseline mortality.

In addition the cumulative NPI stringency index was calculated for the states over the period since non pharmaceutical interventions were introduced. Taking the mean stringency index as a single measure ignores the timing of interventions, but produces a single integrated measure of both intensity and duration of the measures. The duration of school closures in days was also included in the analysis.

Figure 10 shows the results of individual regressions of the crude covid related death ratio on six covariates. Non of the regressions is statistically significant.

cdr<-merge(cdr,cvd)
cdr$covid_percent_of_baseline<-100*(cdr$covid_cdr/cdr$crude_death_rate)

cdr<-merge(cdr,measures)
ggplot(cdr,aes(x=crude_death_rate,y=covid_cdr,label=State))  + geom_smooth(method="lm",alpha=0.5,col="black") +geom_point(size=1) ->gg1

ggplot(cdr,aes(x=percent_over_70,y=covid_cdr,label=State))  + geom_smooth(method="lm",alpha=0.5,col="black") +geom_point(size=1) ->gg2

ggplot(cdr,aes(x=percent_over_70,y=covid_cdr,label=State))  + geom_smooth(method="lm",alpha=0.5,col="black") +geom_point(size=1) ->gg3

ggplot(cdr,aes(x=Re_adjusted_cdr,y=covid_cdr,label=State))  + geom_smooth(method="lm",alpha=0.5,col="black") +geom_point(size=1) ->gg4

ggplot(cdr,aes(stringency,y=covid_cdr,label=State))  + geom_smooth(method="lm",alpha=0.5,col="black") +geom_point(size=1) ->gg5

ggplot(cdr,aes(schools,y=covid_cdr,label=State))  + geom_smooth(method="lm",alpha=0.5,col="black") +geom_point(size=1) ->gg6

multiplot(gg1,gg2,gg3,gg4,gg5,gg6, cols=2)

A multiple additive regression model including all six potential explanatory variables did not show any variable to be significantly associated with the covid crude deaths ratio.

model<- lm(data=cdr, covid_cdr~crude_death_rate+percent_over_70+percent_over_70+Re_adjusted_cdr+stringency+schools)
anova(model, test="F")

Spatial autocorrelation

One alternative explanation for the variability in deaths between states is spatial proximity to neighbouring states with similar death rates. The effect of neighbours can be tested by calculating Moran's I based on a queen's move neighbour rule.

cdr_geo<-merge(us,cdr)
s<-as(cdr_geo,"Spatial")
nb <- poly2nb(s, queen=TRUE)
lw <- nb2listw(nb, style="W", zero.policy=TRUE)
moran.test(s$covid_cdr,lw, alternative="greater")

A more rigorous test involves a Monte Carlo simulation.

moran.mc(s$covid_percent_of_baseline, lw, nsim=999, alternative="greater")

Moran's I can be visualised by

cdr$mean_neighbours <- lag.listw(lw, s$covid_percent_of_baseline)

cdr %>% ggplot(aes(x=mean_neighbours, y=covid_percent_of_baseline, label=State)) +geom_smooth(method="lm", alpha=0.1,col="lightgrey") + geom_text(size=2, check_overlap = TRUE, nudge_y=0.5) +xlab("Mean percent baseline mortality in neighbouring states")  + ylab("Percent baseline mortality")

Change in clustering over time

The change the spatial autocorrelation (clustering) of the covid related crude death rate can be depicted as a change in the Moran's I statistic calculated for the cummulative deaths atributed to covid in each month since the epidemic began in March.

Change in value of Moran's I

Change in Moran's I

Multiple regression model with spatial component

An additive multiple regression model that included the spatial component shows some statistical significance (p=0.017) as a result of inclusion of the spatial term. The only term in the model that was weakly, but statistically significantly, associated with the covid related crude death ratio was the mean crude death ratio related to covid in neighbouring states.

model<- lm(data=cdr, covid_cdr~crude_death_rate+percent_over_70+percent_over_70+Re_adjusted_cdr+stringency+schools + mean_neighbours)
anova(model, test="F")
summary(model)

data("States_COVID19")
d<-States_COVID19
d %>% group_by(State) %>% summarise(tests=max(tests, na.rm=TRUE),cases=max(confirmed,na.rm=TRUE) ,deaths=max(deaths, na.rm=TRUE), population = max(population, na.rm=TRUE)) -> d
d <-merge(measures,d)

d$tests_per<-d$tests/(d$population)
d$cases_per<-d$cases/(d$population/10000)
d$deaths_per<-d$deaths/(d$population/10000)

ggplot(d,aes(x=tests_per,y=deaths_per, label=State) ) + geom_text() + geom_smooth(method="lm") + xlab("Tests per capita") + ylab ("Total covid related deaths per 10k of the population")

Summary

Baseline crude mortality rates are strongly correlated with demographic and socio economic factors. In particular median household income is a effective predictor of a States CDR after holding for the percentage of the population over 70.
There are no statistically significant relationships between covid related crude mortality rate and the key demographic and socio economic determinants of all cause crude death rates.
There are no statistically significant relationships between covid related crude death rate and integrated measures of the stringency of no pharmaceutical interventions.
The best predictor of covid related crude death rate are the rates measured in neighbouring states.

[@CDC; @CDC2020; @CDC2020a; @Force, @phe, @cdc_weekly]

References and data sources

dgolicher/usdata documentation built on Feb. 17, 2021, 6:38 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dgolicher/usdata
Just holds some data

In dgolicher/usdata: Just holds some data

Introduction

Baseline mortality

Crude state wide mortality rates: Years 2011 to 2019

Crude state wide mortality rates: Years 2019 to 2020

Adjustment for age

Adjustment for household income

Explanatory power of demographics and income combined.

Mapping the results

Summary

Covid-19 mortality

Covid deaths

Mapping Covid deaths

Correlates of Covid-19 mortality

Spatial autocorrelation

Change in clustering over time

Multiple regression model with spatial component

Summary

References and data sources

R Package Documentation

Browse R Packages

We want your feedback!

dgolicher/usdata Just holds some data

In dgolicher/usdata: Just holds some data

Introduction

Baseline mortality

Crude state wide mortality rates: Years 2011 to 2019

Crude state wide mortality rates: Years 2019 to 2020

Adjustment for age

Adjustment for household income

Explanatory power of demographics and income combined.

Mapping the results

Summary

Covid-19 mortality

Covid deaths

Mapping Covid deaths

Correlates of Covid-19 mortality

Spatial autocorrelation

Change in clustering over time

Multiple regression model with spatial component

Summary

References and data sources

R Package Documentation

Browse R Packages

We want your feedback!

dgolicher/usdata
Just holds some data