knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) options(tibble.print_min = 4L, tibble.print_max = 4L) library(tidyverse) library(lubridate) library(scales) library(flumodelr)
The primary method paper was a 1963 paper published by Robert Serfling working for the Centers for Disease Control and Prevention.
The paper outlines a strategy for estimating what proportion of deaths are due to influenza. The underlying issue is that many deaths reported due to influenza and pneumonia may be due to other causes, while many deaths attritable other causes may be due to influenza. The primary goal was to develop a "standard curve of expected seasonal mortality". The concept was that an individual could use historical data to estimate seasonal trends in influenza. Then for a given place-time an researcher could evaluate how many deaths occurred in excess of this baseline rate. Much of the original paper is of little interest given modern computing methods, but the basic concept persists as a reasonable approach to estimating "flu" deaths.
Critically, it should be understood that influenza epidemics are highly seasonal with spikes in the winter months, commonly January - February. This seasonality leads to a cyclical rate of influenza morbidity and mortality.
df_cdc <- flumodelr::fludta %>% dplyr::filter(year>=2010) %>% mutate(t = row_number(), #set origin to october theta = 2*t / 52, sin_2 = sinpi(theta), cos_2 = cospi(theta)) g1 <- ggplot(df_cdc, aes(x=yrweek_dt, y=perc_fludeaths)) + geom_line(size=0.8, colour="#CC0000") + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("% of Deaths from P&I") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 1. Pneumonia and Influenza Mortality")
It was noted by early researchers that the rate of reported deaths (red line), may be modeled by a trigonomic function.
The classical ordinary least squares framework is often described in matrix notation as:
$$y_i = X_i\beta + u_{i} \quad \textrm{where} \quad i = 1,..,n$$
Where $y$ is a dependent variable, $X$ is a vector of regressors (i.e. independent variables) with $k$-dimensions, $\beta$ is a vector of the coefficients for $X$, and $u$ is the residual error term. often simply as: $y= X\beta+u$.
Given a simple additive model with one independent variable t: $$Eq \ 1. \ y = \alpha_0 + \beta_1*t_1 + u$$
Let t, be the unit of time in figure 1 (week = 1, week=2, ...). The above model is inadequate, and will poorly fit the data (figure) because the secular trend is non-linear.
g1 <- ggplot(df_cdc, aes(x=yrweek_dt, y=perc_fludeaths)) + geom_line(size=0.8, colour="#CC0000") + geom_smooth(method='lm') + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("% of Deaths from P&I") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 2. Pneumonia and Influenza Deaths Over Time", caption="Ordinary Least Squares") g1
Early investigators such as Serfling proposed that a Fourier term be added to model the cycle like so: $$Eq \ 2. \ y = \alpha_0 + \beta_1*t + sine(\frac{2 \pi t}{52}) + cos(\frac{2 \pi t}{52}) + u$$ The original paper used 4-week periods so the period denominator is 13. This allows us to use a linear model which accounts for the cyclical nature of the disease. The original paper recommends one Fourier term.
The basic linear equation is like so: $$Y = beta_0\alpha + beta_1t + beta_6cos(\frac{2 \pi t}{52}) + beta_7sin(\frac{2 \pi t}{52})$$ Where in our weekly CDC data, t is a single week, with a 52 week cycle. In the original paper it was 4-week periods, with 13 per annual cycle.
Select a season of low epidemicity:
g1 <- ggplot(df_cdc, aes(x=yrweek_dt, y=fludeaths)) + geom_line(size=0.8) + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("No. of Deaths from P&I") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 1. Pneumonia and Influenza Mortality") + theme(axis.text.x=element_text(angle=90, hjust=1)) g1
2011 / 2012 is period of lowest influenza activity, so taken as the baseline.
This season was used to estimate parameters for the above equation then fit to a future season.
df_low <- df_cdc %>% dplyr::filter(yrweek_dt >= ymd('2011-10-01') & yrweek_dt <= ymd('2012-09-30')) %>% mutate(t = row_number(), #set origin to october theta = 2*t / 52, sin_2 = sinpi(theta), cos_2 = cospi(theta)) lm(fludeaths ~ t + sin_2 + cos_2, #note the 0 term data=df_low, na.action = na.exclude) -> mod_1_low summary(mod_1_low)
sin_2
is the Fourier term: sin (2$$\pi$$t) / L where L is length of cycle (i.e. 52 weeks)
Fourier series can be iteratively explored for best fit by adjusting the harmonic
formula.
Inclusion of biannual or monthly cycles can be done with additional Fourier terms.
E.g. With semi-annual Fourier term:
$$Y = beta_0\alpha + beta_1t + beta_6cos(\frac{2 \pi t}{52}) + beta_7sin(\frac{2 \pi t}{52}) + beta_8cos(\frac{2 \pi t}{26}) + beta_9sin(\frac{2 \pi t}{26})$$
df_low_pred <- suppressWarnings(predict(mod_1_low, se.fit=TRUE, interval="prediction", level=0.90)) df_low <- bind_cols(df_low, pred_yhat_0 = df_low_pred$fit[, 'fit']) g_mod_1 <- ggplot(df_low) + geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) + geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("No. of deaths") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 2. Original Serfling equation fit to 2011/12") + theme(axis.text.x=element_text(angle=90, hjust=1)) g_mod_1
The fit works well for this particular period.
df_high <- df_cdc %>% dplyr::filter(yrweek_dt >= ymd('2012-10-01') & yrweek_dt <= ymd('2013-09-30')) %>% mutate(t = row_number(), #set origin to october theta = 2*t / 52, sin_2 = sinpi(theta), cos_2 = cospi(theta)) df_pred <- predict(mod_1_low, newdata=df_high, se.fit=TRUE, interval="prediction", level=0.90) pred_yhat_0 <- df_pred$fit[,1] #fitted values pred_yhat_upr <- df_pred$fit[,3] df_mod_1 <- df_high %>% add_column(., pred_yhat_0, pred_yhat_upr)
g_mod_1 <- ggplot(df_mod_1) + geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) + geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') + geom_line(aes(x=yrweek_dt, y=pred_yhat_upr), linetype='dotted', color='red', size=1) + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("No. of deaths") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 3. 2011/12 estimates fitted to 2012/13 season") + theme(axis.text.x=element_text(angle=90, hjust=1)) g_mod_1
Gray - observed deaths
Blue dash - fitted curved for 2011 / 2012, applied to 2012 / 2013
Red dotted - upper 90% prediction interval
In the original paper, deaths exceeding the 90% interval would be considered in excess, and if two consecutive weeks occurred than that was a 'epidemic'.
However, it was noted in the paper this was less practical when consecutive years occur without a very low endemicity flu season. An alternative approach was given that is more generalizable.
One approach is to estimate the model using the previous 5 years of data. Assuming this will average out epidemics and give a baseline measure of activity.
df_cdc_2 <- df_cdc %>% dplyr::filter(fluyr<2014) %>% mutate(t = row_number(), #set origin to october theta = 2*t / 52, sin_2 = sinpi(theta), cos_2 = cospi(theta)) mod_1_sec_fit <- df_cdc_2 %>% lm(fludeaths ~ t + sin_2 + cos_2, #note the 0 term data=., na.action = na.exclude) df_pred <- predict(mod_1_sec_fit, newdata = df_cdc, se.fit=TRUE, interval="prediction", level=0.90) pred_yhat_0 <- df_pred$fit[,1] #fitted values pred_yhat_upr <- df_pred$fit[,3] df_mod_1_sec <- df_cdc %>% add_column(., pred_yhat_0, pred_yhat_upr)
g_mod_1 <- ggplot(df_mod_1_sec[df_mod_1_sec$fluyr<2014, ]) + geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) + geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') + geom_line(aes(x=yrweek_dt, y=pred_yhat_upr), linetype='dotted', color='red', size=1) + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("No. of deaths") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 3. 2010 - 2014 estimates fitted") + theme(axis.text.x=element_text(angle=90, hjust=1)) g_mod_1
Model fit from first four seasons, notice the 2012 / 2013 season does not fit the curve well.
g_mod_1 <- ggplot(df_mod_1_sec[df_mod_1_sec$fluyr==2014, ]) + geom_line(aes(x=yrweek_dt, y=fludeaths), size=0.8, alpha=0.5) + geom_line(aes(x=yrweek_dt, y=pred_yhat_0), linetype='dashed', size=1, color = 'blue') + geom_line(aes(x=yrweek_dt, y=pred_yhat_upr), linetype='dotted', color='red', size=1) + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("No. of deaths") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 3. 2010 - 2014 estimates fitted to 2014/2015 season") + theme(axis.text.x=element_text(angle=90, hjust=1)) g_mod_1
Gray - observed deaths
Blue dash - fitted curved for 2011 / 2012, applied to 2012 / 2013
Red dotted - upper 90% prediction interval
df_rpt <- df_mod_1_sec %>% dplyr::filter(fluyr==2014) %>% select(fluyr, week, fludeaths, pred_yhat_0, pred_yhat_upr) %>% rowwise() %>% mutate(excess_deaths = max(fludeaths - pred_yhat_upr, 0)) %>% round(., 0) head(df_rpt)
Most weeks no excess deaths are reported, until the month of January when an epidemic occurs.
total <- formatC(sum(df_rpt$fludeaths), format = 'd', big.mark = ',') total_excess <- formatC(sum(df_rpt$excess_deaths), format = 'd', big.mark = ',')
The total reported deaths for the 2014 - 2015 influenza season were r total
but only r total_excess
are deemed in excess of expected based the 90% upper prediction interval for the cyclical regression.
## Add fourier term fit <- fludta %>% dplyr::filter(year>=2010) %>% mutate(week2 = row_number(), theta = 2*week2/52, sin_f1 = sinpi(theta), cos_f1 = cospi(theta), pred_y = predict( lm(perc_fludeaths ~ week2 + sin_f1 + cos_f1, data=.)))
g1 <- ggplot(fit, aes(x=yrweek_dt)) + geom_line(aes(y=perc_fludeaths), size=0.8, colour="#CC0000") + geom_line(aes(y=pred_y), size=0.8, colour="black") + scale_x_date(labels = date_format("%Y"), date_breaks="1 year", expand=c(0, .9)) + xlab("Year") + ylab("% of Deaths from P&I") + theme_light(base_size=16) + theme(plot.title = element_text(size=14)) + labs(title="Figure 3. Pneumonia and Influenza Mortality", caption="Serfling Model") g1
A single Fourier term does a reasonable job approximating the secular trend of disease, except notably during severe influenza epidemics. The black solid line represents a single Fourier term.
During certain seasons, there is a pronounced spike above the predicted (fit) line. For example in the 12-13, 13-14 and 14-15 seasons. These can be subjectively considered severe influenza epidemics. However a goal could be to define objective criteria for epidemics. Such as any time-period 1.64 standard deviations above the fitted line representing a time of severe influenza morbidity and mortality, requiring significant public health intervention.
By default R's predict.lm will fit a 95% prediction interval. In Serfling's original paper it states:
"The epidemic threshold...is placed at a distance of 1.64 standard deviations above the trend line, a level which experience has shown to be useful for distinguishing epidemic increase from random variation."
The CDC also describes the threshold as:
"The seasonal baseline of P&I deaths is calculated using a periodic regression model that incorporates a robust regression procedure applied to data from the previous five years. An increase of 1.645 standard deviations > above the seasonal baseline of P&I deaths is considered the 'epidemic threshold,' i.e., the point at which the observed proportion of deaths attributed to pneumonia or influenza was significantly higher than would be expected at that time of the year in the absence of substantial influenza-related mortality."
fluserf
follows this method, but the modified serfling function allows for more options for threshold prediction. 1.645 standard deviations is equivalents to a one-sided upper limit 95% confidence interval.
##The following command completes the above steps ##fit serfling model flu_fit <- fluserf(data=df_cdc, outc=perc_fludeaths, time=yrweek_dt) head(flu_fit)
fluplot(flu_fit, xvar=yrweek_dt, perc_fludeaths, y0, y0_ul, ylab="% Deaths from P&I", title="Serfling Model")
The historical Serfling model should be considered as important background and an useful educational tool. But probably not applied in an research project.
The authors view the approach as a pragmatic one in the context of the 1960s when computation was difficult, but its implementation is now somewhat dated given modern computing methods. The constrains of needing an easy to estimate linear model are no longer relevant to the modern analyst. The use of a Fourier term to obtain fitted estimates and coefficients for "off-season" timepoints, then construct fitted lines for "on-season" timepoints makes many untested assumptions about the functional form and fixed parameters of seasonal influenza trend lines.
More elegant models (e.g. ARIMA, splines) are now available which may overcame the above limitations, and make fewer assumptions about the functional form.
Additionally the approach is dependent on an accurate baseline period. Whether to include mild or known epidemic seasons, which cut-off to define as the influenza season (e.g. week 40 - week 20) are important but subjective decisions the analyst must make.
The selected threshold for what constitutes "excess" is somewhat arbitrary. The original Serfling paper describes 1.64 standard deviations for 2 or more weeks as criteria. Recent papers have used the 95% prediction interval.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.