knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
options(tibble.print_min = 4L, tibble.print_max = 4L)
library(tidyverse)
library(lubridate)
library(scales)
library(flumodelr)

The primary method paper was a 1963 paper published by Robert Serfling working for the Centers for Disease Control and Prevention.

Serfling RE. Methods for current statistical analysis of excess pneumonia-influenza deaths. Public Health Rep. 1963 Jun; 78(6): 494 - 506.

The paper outlines a strategy for estimating what proportion of deaths are due to influenza. The underlying issue is that many deaths reported due to influenza and pneumonia may be due to other causes, while many deaths attritable other causes may be due to influenza. The primary goal was to develop a "standard curve of expected seasonal mortality". The concept was that an individual could use historical data to estimate seasonal trends in influenza. Then for a given place-time an researcher could evaluate how many deaths occurred in excess of this baseline rate. Much of the original paper is of little interest given modern computing methods, but the basic concept persists as a reasonable approach to estimating "flu" deaths.

Influenza

Critically, it should be understood that influenza epidemics are highly seasonal with spikes in the winter months, commonly January - February. This seasonality leads to a cyclical rate of influenza morbidity and mortality.

df_cdc <- flumodelr::fludta %>%
  dplyr::filter(year>=2010) %>%
  mutate(t = row_number(), #set origin to october
         theta = 2*t / 52,
         sin_2 = sinpi(theta),
         cos_2 = cospi(theta)) 

g1 <- ggplot(df_cdc, aes(x=yrweek_dt, y=perc_fludeaths)) + 
  geom_line(size=0.8, colour="#CC0000") +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("% of Deaths from P&I") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 1. Pneumonia and Influenza Mortality")

It was noted by early researchers that the rate of reported deaths (red line), may be modeled by a trigonomic function.

Linear and Cyclical Regression Models

The classical ordinary least squares framework is often described in matrix notation as:
$$y_i = X_i\beta + u_{i} \quad \textrm{where} \quad i = 1,..,n$$

Where $y$ is a dependent variable, $X$ is a vector of regressors (i.e. independent variables) with $k$-dimensions, $\beta$ is a vector of the coefficients for $X$, and $u$ is the residual error term. often simply as: $y= X\beta+u$.

Given a simple additive model with one independent variable t: $$Eq \ 1. \ y = \alpha_0 + \beta_1*t_1 + u$$

Let t, be the unit of time in figure 1 (week = 1, week=2, ...). The above model is inadequate, and will poorly fit the data (figure) because the secular trend is non-linear.

g1 <- ggplot(df_cdc, aes(x=yrweek_dt, y=perc_fludeaths)) + 
  geom_line(size=0.8, colour="#CC0000") +
  geom_smooth(method='lm') +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("% of Deaths from P&I") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 2. Pneumonia and Influenza Deaths Over Time",
       caption="Ordinary Least Squares")
g1

Early investigators such as Serfling proposed that a Fourier term be added to model the cycle like so: $$Eq \ 2. \ y = \alpha_0 + \beta_1*t + sine(\frac{2 \pi t}{52}) + cos(\frac{2 \pi t}{52}) + u$$ The original paper used 4-week periods so the period denominator is 13. This allows us to use a linear model which accounts for the cyclical nature of the disease. The original paper recommends one Fourier term.

Model 1: Original Serfling Model

The basic linear equation is like so: $$Y = beta_0\alpha + beta_1t + beta_6cos(\frac{2 \pi t}{52}) + beta_7sin(\frac{2 \pi t}{52})$$ Where in our weekly CDC data, t is a single week, with a 52 week cycle. In the original paper it was 4-week periods, with 13 per annual cycle.

Method 1: Endemic baseline

Select a season of low epidemicity:

g1 <- ggplot(df_cdc, aes(x=yrweek_dt, y=fludeaths)) + 
  geom_line(size=0.8) +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("No. of Deaths from P&I") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 1. Pneumonia and Influenza Mortality") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

g1

2011 / 2012 is period of lowest influenza activity, so taken as the baseline.

This season was used to estimate parameters for the above equation then fit to a future season.

Estimate model for 2011 / 2012 season

df_low <- df_cdc %>%
  dplyr::filter(yrweek_dt >= ymd('2011-10-01') & yrweek_dt <= ymd('2012-09-30')) %>%
  mutate(t = row_number(), #set origin to october
         theta = 2*t / 52,
         sin_2 = sinpi(theta),
         cos_2 = cospi(theta)) 

lm(fludeaths ~ t + sin_2 + cos_2, #note the 0 term
  data=df_low, na.action = na.exclude) -> mod_1_low 

summary(mod_1_low)

sin_2 is the Fourier term: sin (2$$\pi$$t) / L where L is length of cycle (i.e. 52 weeks)
Fourier series can be iteratively explored for best fit by adjusting the harmonic formula.

Inclusion of biannual or monthly cycles can be done with additional Fourier terms.

E.g. With semi-annual Fourier term:

$$Y = beta_0\alpha + beta_1t + beta_6cos(\frac{2 \pi t}{52}) + beta_7sin(\frac{2 \pi t}{52}) + beta_8cos(\frac{2 \pi t}{26}) + beta_9sin(\frac{2 \pi t}{26})$$

df_low_pred <- suppressWarnings(predict(mod_1_low, se.fit=TRUE, 
          interval="prediction", level=0.90)) 

df_low <- bind_cols(df_low, pred_yhat_0 = df_low_pred$fit[, 'fit'])

g_mod_1 <- ggplot(df_low) + 
  geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) +
  geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("No. of deaths") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 2. Original Serfling equation fit to 2011/12") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

g_mod_1

The fit works well for this particular period.

Apply model estimates to fit 2012 / 2013 year.

df_high <- df_cdc %>%
  dplyr::filter(yrweek_dt >= ymd('2012-10-01') & yrweek_dt <= ymd('2013-09-30')) %>%
  mutate(t = row_number(), #set origin to october
         theta = 2*t / 52,
         sin_2 = sinpi(theta),
         cos_2 = cospi(theta)) 

df_pred <- predict(mod_1_low, newdata=df_high, se.fit=TRUE, 
           interval="prediction", level=0.90)  

pred_yhat_0 <- df_pred$fit[,1] #fitted values
pred_yhat_upr <- df_pred$fit[,3] 

df_mod_1 <- df_high %>%
  add_column(., pred_yhat_0, pred_yhat_upr)  
g_mod_1 <- ggplot(df_mod_1) + 
  geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) +
  geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') +
  geom_line(aes(x=yrweek_dt, y=pred_yhat_upr), linetype='dotted', color='red', size=1) +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("No. of deaths") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 3. 2011/12 estimates fitted to 2012/13 season") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

g_mod_1

Gray - observed deaths
Blue dash - fitted curved for 2011 / 2012, applied to 2012 / 2013 Red dotted - upper 90% prediction interval

In the original paper, deaths exceeding the 90% interval would be considered in excess, and if two consecutive weeks occurred than that was a 'epidemic'.

However, it was noted in the paper this was less practical when consecutive years occur without a very low endemicity flu season. An alternative approach was given that is more generalizable.

Method 2: Secular trend differencing

One approach is to estimate the model using the previous 5 years of data. Assuming this will average out epidemics and give a baseline measure of activity.

df_cdc_2  <- df_cdc %>%
  dplyr::filter(fluyr<2014) %>%
  mutate(t = row_number(), #set origin to october
         theta = 2*t / 52,
         sin_2 = sinpi(theta),
         cos_2 = cospi(theta)) 

mod_1_sec_fit <- df_cdc_2 %>%
  lm(fludeaths ~ t + sin_2 + cos_2, #note the 0 term
    data=., na.action = na.exclude) 

df_pred <- predict(mod_1_sec_fit, newdata = df_cdc, se.fit=TRUE, 
           interval="prediction", level=0.90)  

pred_yhat_0 <- df_pred$fit[,1] #fitted values
pred_yhat_upr <- df_pred$fit[,3] 

df_mod_1_sec <- df_cdc %>%
  add_column(., pred_yhat_0, pred_yhat_upr)  
g_mod_1 <- ggplot(df_mod_1_sec[df_mod_1_sec$fluyr<2014, ]) + 
  geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) +
  geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') +
  geom_line(aes(x=yrweek_dt, y=pred_yhat_upr), linetype='dotted', color='red', size=1) +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("No. of deaths") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 3. 2010 - 2014 estimates fitted") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

g_mod_1

Model fit from first four seasons, notice the 2012 / 2013 season does not fit the curve well.

g_mod_1 <- ggplot(df_mod_1_sec[df_mod_1_sec$fluyr==2014, ]) + 
  geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) +
  geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') +
  geom_line(aes(x=yrweek_dt, y=pred_yhat_upr), linetype='dotted', color='red', size=1) +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("No. of deaths") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 3. 2010 - 2014 estimates fitted to 2014/2015 season") +
  theme(axis.text.x=element_text(angle=90, hjust=1))

g_mod_1

Gray - observed deaths
Blue dash - fitted curved for 2011 / 2012, applied to 2012 / 2013 Red dotted - upper 90% prediction interval

In quantitative terms:

df_rpt <- df_mod_1_sec %>%
  dplyr::filter(fluyr==2014) %>%
  select(fluyr, week, fludeaths, pred_yhat_0, pred_yhat_upr) %>%
  rowwise() %>%
  mutate(excess_deaths = max(fludeaths - pred_yhat_upr, 0)) %>%
  round(., 0) 

head(df_rpt)

Most weeks no excess deaths are reported, until the month of January when an epidemic occurs.

total <- formatC(sum(df_rpt$fludeaths), format = 'd', big.mark = ',')
total_excess <- formatC(sum(df_rpt$excess_deaths), format = 'd', big.mark = ',')

The total reported deaths for the 2014 - 2015 influenza season were r total but only r total_excess are deemed in excess of expected based the 90% upper prediction interval for the cyclical regression.

Example. Fitted line (one Fourier term)

## Add fourier term
fit <- fludta %>%
  dplyr::filter(year>=2010) %>%
  mutate(week2 = row_number(),
         theta = 2*week2/52,
         sin_f1 = sinpi(theta),
         cos_f1 = cospi(theta),
         pred_y = predict( lm(perc_fludeaths ~ 
                                week2 + sin_f1 + cos_f1, data=.)))
g1 <- ggplot(fit, aes(x=yrweek_dt)) + 
  geom_line(aes(y=perc_fludeaths), size=0.8, colour="#CC0000") +
  geom_line(aes(y=pred_y), size=0.8, colour="black") +
  scale_x_date(labels = date_format("%Y"), date_breaks="1 year",
               expand=c(0, .9)) + 
  xlab("Year") + 
  ylab("% of Deaths from P&I") + 
  theme_light(base_size=16) +
  theme(plot.title = element_text(size=14)) +
  labs(title="Figure 3. Pneumonia and Influenza Mortality",
       caption="Serfling Model")
g1

A single Fourier term does a reasonable job approximating the secular trend of disease, except notably during severe influenza epidemics. The black solid line represents a single Fourier term.

During certain seasons, there is a pronounced spike above the predicted (fit) line. For example in the 12-13, 13-14 and 14-15 seasons. These can be subjectively considered severe influenza epidemics. However a goal could be to define objective criteria for epidemics. Such as any time-period 1.64 standard deviations above the fitted line representing a time of severe influenza morbidity and mortality, requiring significant public health intervention.

Note regarding standard errors.

By default R's predict.lm will fit a 95% prediction interval. In Serfling's original paper it states:

"The epidemic threshold...is placed at a distance of 1.64 standard deviations above the trend line, a level which experience has shown to be useful for distinguishing epidemic increase from random variation."

The CDC also describes the threshold as:

"The seasonal baseline of P&I deaths is calculated using a periodic regression model that incorporates a robust regression procedure applied to data from the previous five years. An increase of 1.645 standard deviations > above the seasonal baseline of P&I deaths is considered the 'epidemic threshold,' i.e., the point at which the observed proportion of deaths attributed to pneumonia or influenza was significantly higher than would be expected at that time of the year in the absence of substantial influenza-related mortality."

fluserf follows this method, but the modified serfling function allows for more options for threshold prediction. 1.645 standard deviations is equivalents to a one-sided upper limit 95% confidence interval.

fluserf() function

##The following command completes the above steps 
##fit serfling model
flu_fit <- fluserf(data=df_cdc, outc=perc_fludeaths, time=yrweek_dt)

head(flu_fit)

Plot function fluplot

fluplot(flu_fit, xvar=yrweek_dt, perc_fludeaths, y0, y0_ul,
                    ylab="% Deaths from P&I", title="Serfling Model")

Criticisms of this approach

The historical Serfling model should be considered as important background and an useful educational tool. But probably not applied in an research project.

References



kmcconeghy/flumodelr documentation built on June 7, 2019, 8:47 p.m.