Introduction to the pifpaf package: Estimating Potential Impact and Population Attributable Fractions from Cross-Sectional data
In pifpaf: Potential Impact Fraction and Population Attributable Fraction for Cross-Sectional Data

require(pifpaf)
require(ggplot2)
set.seed(3256)

Introduction

The Potential Impact Fraction (PIF) quantifies the contribution of risk-factor exposure to either morbidity (or mortality). In particular, it compares the observed burden of disease (or death) with a hypothetical counterfactual scenario. PIF is usually defined [@murray2003comparative; @vander2004estimating] for some exposure $X\in \mathbb{R}^p$ with parametrical Relative Risk $RR(X;\theta)$ with parameter $\theta$, and counterfactual function $\textrm{cft}$. If $X$ is categorical (discrete) then

\begin{equation} \textrm{PIF} = \frac{\sum_{i=1}^m P_i \cdot RR(X_i;\theta) - \sum_{i=1}^m P_i \cdot RR\big(\textrm{cft}(X_i);\theta\big)}{\sum_{i=1}^m P_i \cdot RR(X_i;\theta)}, \end{equation}

and if $X$ is continuous:

\begin{equation} \textrm{PIF} = \frac{\int_{\mathbb{R}^p} RR(X;\theta)f(X)dX - \int_{\mathbb{R}^p} RR\big(\textrm{cft}(X);\theta \big)f(X)dX}{\int_{\mathbb{R}^p} RR(X;\theta)f(X)dX}. \end{equation}

In the aforementioned equations $P_i$ represents the probability of $X$ being at the $i$-th category and $f$ the density function of $X$.

Some examples of Relative Risk functions include [@barendregt2010categorical]:

Linear: $RR(X;\theta) = 1 + \theta X$,
Exponential: $RR(X;\theta) = e^{\theta X}$,
Quadratic: $RR(X;\theta_1, \theta_2) = 1 + \theta_1 X + \theta_2 X^2$.

We remark that in both discrete and continuous cases, when the counterfactual is that of the "theoretical-minimum-risk-exposure" (i.e. the counterfactual corresponds to a Relative Risk of $1$) the PIF is equivalent to the Population Attributable Fraction (PAF) defined as:

\begin{equation} \textrm{PAF} = \begin{cases} \frac{\sum_{i=1}^m P_i \cdot RR(X_i;\theta) - 1}{\sum_{i=1}^m P_i \cdot RR(X_i;\theta)} & \textrm{if } X \textrm{ is categorical}, \ \ \frac{\int_{\mathbb{R}^p} RR(X;\theta)f(X)dX - 1}{\int_{\mathbb{R}^p} RR(X;\theta)f(X)dX} & \textrm{if } X \textrm{ is continuous}. \ \end{cases} \end{equation}

In this document we present the pifpaf package which allows the estimation of both PIF and PAF when using information from cross-sectional data. This document is structured as follows:

We present a "quick start" guide if you really cannot resist the urge of using the package.
We briefly discuss the methods involved and their implementations.
We dive into deeper examples concerning more advanced options of our package so you are able to use it like a pro. These include estimation methods, confidence interval methods and plots.
We show some common error messages and how to fix them.

Quick start

The basic ingredients for using the package are:

A Relative Risk function of the exposure $RR(X;\theta)$ which might depends on a parameter $\theta$.
A sample $X_{(n)}$ of $n$ exposure values (or if it is not available, at least estimated mean $m$ and variance $s^2$ of the exposure).
A consistent and asymptotically normal estimator $\hat{\theta}$ of $\theta$.
A function $\textrm{cft}$ that transforms the observed exposure into that of a counterfactual scenario.

If you have those ingredients you are ready to start with our examples which include: a complete sample of the exposure with continuous Relative Risks, a complete sample of the exposure with categorical Relative Risks, only mean and variance of exposure with continuous Relative Risks, and only mean and variance of exposure with categorical Relative Risks.

Continuous RR example: ozone exposure

This example aims to estimate the PIF and PAF of ozone on children's lung growth. The airquality dataset (included in R) has information on ozone levels (ppb) for New York City.

require(datasets)
ozone_exposure  <- na.omit(airquality$Ozone) 
ozone_exposure  <- as.data.frame(ozone_exposure)

Furthermore, assume normalized sampling weights for ozone exposure are given by:

sampling_weights <- c(rep(1/232, 58), rep(0.75/58, 58))

Suppose the Relative Risk of reduced lung growth given exposure is defined by: \begin{equation} RR(X;\theta) = e^{\theta X/5}, \end{equation} where $\theta$ is estimated by $\hat{\theta} = 0.17$ with variance $\sigma_\theta^2 = 0.00025$:

thetahat <- 0.17
thetavar <- 0.00025

We can code the Relative Risk function as:

rr <- function(X, theta){ exp(theta*X/5) }

Notice that the parameters should be $X$ and $\theta$ in that order. Never forget this! Now we are ready to estimate the Population Attributable Fraction:

paf(X = ozone_exposure, thetahat = thetahat, rr = rr, weights = sampling_weights)

We can estimate the Potential Impact Fraction provided we have a counterfactual. Let's assume we want to scale exposure to ozone in 50\% and reduce it by $1$ ppb. The counterfactual function is:

cft <- function(X){0.5*X - 1}

Notice that the counterfactual is solely a function of the exposure $X$. We are now ready to compute the Potential Impact Fraction:

pif(X = ozone_exposure, thetahat = thetahat, rr = rr, cft = cft, weights = sampling_weights)

No study is complete without confidence intervals. Let's calculate the confidence intervals for both PAF and PIF:

paf.confidence(X = ozone_exposure, thetahat = thetahat, thetavar = thetavar, rr = rr, weights = sampling_weights, nsim = 200)

pif.confidence(X = ozone_exposure, thetahat = thetahat, thetavar = thetavar, rr = rr, cft = cft, weights = sampling_weights, nsim = 200)

Several plots are available to enrich our study. We can plot the effect of the counterfactual:

counterfactual.plot(X = ozone_exposure, cft = cft, weights = sampling_weights, n=250)

We can also conduct several sensitivity analysis:

To see how the PAF changes as $\theta$ changes between $0$ and $1/\pi$ (notice that it is nonlinear).

paf.plot(X = ozone_exposure, thetalow = 0, thetaup = 1/pi, rr = rr, weights = sampling_weights, mpoints = 25, nsim = 15)

The same plot is available for the PIF:

pif.plot(X = ozone_exposure, thetalow = 0, thetaup = 1/pi, rr = rr, cft = cft, weights = sampling_weights, mpoints = 25, nsim = 15)

To evaluate the robustness of the PAF (i.e. would it change much for a different sample?).

paf.sensitivity(X = ozone_exposure, thetahat = thetahat, rr = rr, weights = sampling_weights, nsim = 10, mremove = 20)

The same can be done for the PIF:

pif.sensitivity(X = ozone_exposure, thetahat = thetahat, rr = rr, weights = sampling_weights, nsim = 10, mremove = 20)

To conduct a sensitivity analysis on how the PIF changes as the parameters of the counterfactual change. Notice that in order to use this function we need to specify in the definition of counterfactual the parameters involved (maximum 2):

#Change the counterfactual function to specify the parameters involved
cft_sensitivity <- function(X, a, b){a*X - b}

We can also specify the range at which we will change the counterfactual's parameters. For this example, let's change $b$ from $0$ to $1$ and $a$ from $0.5$ to $0.75$.

#Do the sensitivity analysis
pif.heatmap(X = ozone_exposure, thetahat = thetahat, rr = rr, cft = cft_sensitivity, mina = 0.5, maxa = 0.75, minb = 0, maxb = 1, weights = sampling_weights, nmesh = 5)

Discrete RR example: Tobacco consumption

In this example we will compute the PIF and PAF of tobacco consumption over oesophageal cancer. For that purpose we will use the esoph dataset included in R.

require(datasets)
tobacco_consumption <- as.data.frame(esoph$tobgp)

This variable contains categorical information on the number of grams/day of tobacco consumed. Assume the Relative Risk function is given by: \begin{equation} RR(X;\theta) = \begin{cases} \theta_1 & \textrm{ if consumption is } 0-9 \textrm{ g/day},\ \theta_2 & \textrm{ if consumption is } 10-19, \ \theta_3 & \textrm{ if consumption is } 20-29, \ \theta_4 & \textrm{ if consumption is } 30_{+}. \ \end{cases} \end{equation} with estimators $\hat{\theta}_1 = 1$, $\hat{\theta}_2 = 1.59$, $\hat{\theta}_3 = 2.57$, $\hat{\theta}_4 = 4.11$ of the respective $\theta$s. This can be programmed in R as follows:

#Thetas
thetahat <- c(1, 1.59, 2.57, 4.11)

#Relative Risk
rr       <- function(X, theta){

  #Create empty vector to fill with RR's
  r_risk <- rep(NA, nrow(X))

  #Select by cases
  r_risk[which(X == "0-9g/day")] <- theta[1]
  r_risk[which(X == "10-19")]    <- theta[2]
  r_risk[which(X == "20-29")]    <- theta[3]
  r_risk[which(X == "30+")]      <- theta[4]

  return(r_risk)
}

Notice that the Relative Risk assumes the exposure $X$ is a data.frame with each row representing an individual. We can estimate the Population Attributable Fraction:

paf(tobacco_consumption, thetahat, rr)

Consider the counterfactual scenario where smokers in the categories $20-29$ and $30_{+}$ reduce their consumption to the $10-19$ category. This can be coded as:

cft <- function(X){

  #Create empty matrix to fill with RR's
  new_tobacco <- matrix(NA, nrow = nrow(X), ncol = 1)

  #Select by cases
  new_tobacco[which(X == "0-9g/day")] <- "0-9g/day"  #These remain
  new_tobacco[which(X == "10-19")]    <- "10-19"     #the same
  new_tobacco[which(X == "20-29")]    <- "10-19"
  new_tobacco[which(X == "30+")]      <- "10-19"

  # X in relative risk is received as a data.frame
  new_tobacco <- as.data.frame(new_tobacco)
  return(new_tobacco)
}

The Potential Impact Fraction is given by:

pif(tobacco_consumption, thetahat, rr, cft)

In order to compute confidence intervals, assume the following covariance matrix of $\hat{\theta}$: \begin{equation} \Sigma_{\theta} = \left( \begin{array}{cccc} 0.119 & 0 & 0 & 0 \ 0 & 0.041 & 0 & 0 \ 0 & 0 & 0.001 & 0 \ 0 & 0 & 0 & 0.093 \end{array} \right) \end{equation}

which in R is:

thetavar <- diag(c(0.119, 0.041, 0.001, 0.093))

The confidence interval for the PAF is:

paf.confidence(X = tobacco_consumption, thetahat = thetahat, thetavar = thetavar, rr = rr,  confidence_method = "bootstrap", nsim = 200)

The confidence interval for the Potential Impact Fraction is given by:

pif.confidence(X = tobacco_consumption, thetahat = thetahat, thetavar = thetavar, rr = rr, cft = cft, confidence_method = "bootstrap", nsim = 200)

We remark that "bootstrap" is the only confidence_method designed for categorical relative risks.

The counterfactual.plot function produces an appropriate plot for the discrete exposure:

counterfactual.plot(tobacco_consumption, cft)

A sensitivity analysis to evaluate both PAF and PIF's robustness is available:

paf.sensitivity(tobacco_consumption, thetahat=thetahat, rr, 
                nsim = 10, mremove = 20)

pif.sensitivity(tobacco_consumption, thetahat=thetahat, rr=rr, cft = cft,
                nsim = 10, mremove = 20)

Incomplete data Continuous RR example: Systolic Blood Pressure

Consider the following data on Systolic Blood Pressure (SBP measured in mmHg) in females aged 30-44 by world region from [@lawes2006blood]:

sbp <- data.frame("Region"   = c("Afr D", "Afr E", "Amr A", "Amr B", "Amr D",
                                 "Emr B", "Emr D", "Eur A", "Eur B", "Eur C",
                                 "Sear B", "Sear D", "Wpr A", "Wpr B"), 
                  "SBP_mean" = c(123, 121, 114, 115, 117, 126, 121, 
                                 122, 122, 125, 120, 117, 120, 115),
                  "SBP_sd"   = c(20, 13, 14, 15, 15, 15, 15, 
                                 15, 16, 17, 15, 14, 15, 16))

Furthermore, consider the Relative Risk of mortality given SBP as: \begin{equation} RR(SBP; \theta) = 1 + \theta (SBP - 115)^2/121 \end{equation} with $\hat{\theta} = 0.71$ estimator of $\theta$ with estimated variance $s^2 = 0.002$. In R this is given by:

thetahat <- 0.71
thetavar <- 0.002

#Notice that the theoretical minimum risk value is 115 and not 0
rr       <- function(X, theta){ theta*(X - 115)^2/121 + 1}

In this case, only mean and standard deviation information is available for each region. Terrible calamity! However the pifpaf package is prepared for such cases and the "approximate" method is in order. For example, let's calculate the Population Attributable Fraction for the "Afr E" region:

#Get mean and variance
afr_mean <- as.data.frame(subset(sbp, Region == "Afr E")$SBP_mean)
afr_var  <- subset(sbp, Region == "Afr E")$SBP_sd^2 

#Calculate paf using approximate method
paf(X = afr_mean, thetahat = thetahat, rr = rr, method = "approximate", Xvar = afr_var, check_rr = FALSE)

We can also compute confidence intervals:

paf.confidence(X = afr_mean, thetahat = thetahat,  thetavar = thetavar, rr = rr, method = "approximate", Xvar = afr_var, check_rr = FALSE, nsim = 200)

A counterfactual of reducing the overall SBP in 5 mmHg is given by:

cft <- function(X){X - 5}

The Potential Impact Fraction translates into:

pif(X = afr_mean, thetahat = thetahat, rr = rr, cft = cft, method = "approximate", Xvar = afr_var, check_rr = FALSE)

with confidence interval:

pif.confidence(X = afr_mean, thetahat = thetahat, rr = rr, cft = cft, method = "approximate", Xvar = afr_var, check_rr = FALSE, thetavar = thetavar, nsim = 200)

We can plot how PAF (and PIF) estimates change as functions of $\theta$:

paf.plot(X = afr_mean, thetalow = 0, thetaup = 1, rr = rr, method = "approximate", Xvar = afr_var,  check_rr = FALSE, mpoints = 25, nsim = 15)

pif.plot(X = afr_mean, thetalow = 0, thetaup = 1, rr = rr, cft = cft, method = "approximate", Xvar = afr_var,  check_rr = FALSE, mpoints = 25, nsim = 15)

Incomplete data Categorical RR example: Body Mass Index

Consider the Relative Risk of dying associated to Body Mass Index (BMI) is given by: \begin{equation} RR(BMI;\theta)=\begin{cases} 1 & \textrm{if } BMI<25,\ 1.3\times BMI/25 & \textrm{if } 25\leq BMI<30,\ e^{0.62\times BMI/30} & \textrm{if } BMI>30. \end{cases} \end{equation}

Assume only the proportion of individuals in each category is known as well as the per-category mean and variance:

problem_data <- data.frame(Proportions   = c(  0.56,       0.21,       0.23),
                           Mean          = c(  23.2,       27.1,       31.9),
                           Variance      = c(  1.00,       0.87,       1.12))

rownames(problem_data) <- c("Normal", "Overweight", "Obese")

The approximate method as used in the previous example cannot be directly used as the Relative Risk function is non-differentiable (i.e. it is defined "by parts"). However we can compute the PAF for each category (Normal, Overweight and Obese) and then combine them. For that purpose, we define the Relative Risks for each category:

rr_normal     <- function(X, theta){theta}
rr_overweight <- function(X, theta){theta*X/25}
rr_obese      <- function(X, theta){exp(theta*X/30)}

and then compute the PAFs:

#Subpopulation PAF
paf_normal     <- paf(as.data.frame(problem_data["Normal","Mean"]), 1.00, rr = rr_normal, check_rr = FALSE, method = "approximate", Xvar = problem_data["Normal","Variance"])
paf_overweight <- paf(as.data.frame(problem_data["Overweight","Mean"]), 1.39, rr = rr_overweight, check_rr = FALSE, method = "approximate", Xvar = problem_data["Overweight","Variance"])
paf_obese      <- paf(as.data.frame(problem_data["Obese","Mean"]), 0.62, rr = rr_obese, check_rr = FALSE, method = "approximate", Xvar = problem_data["Obese","Variance"])

Finally the PAFs can be combined into the population PAF:

#Population PAF
paf.combine(c(paf_normal, paf_overweight, paf_obese), problem_data$Proportions)

If pifs are estimated you can use the pif.combine function. Notice that in this case, no confidence intervals are available as no information on the correlation between the BMI categories is assumed.

Methods

A broader definition of the Potential Impact Fraction (which includes both cases presented in the Introduction) is given by: \begin{equation} \textrm{PIF} = 1 - \frac{E_{X}\left[RR\big(\textrm{cft}(X);\theta\big)\right] }{E_{X}\left[RR\big(X; \theta\big)\right]}, \end{equation} where $\textrm{cft}(X)$ denotes the counterfactual transform of the exposure $X$, $RR$ the relative risk function with parameter $\theta$. Note that the PAF is a special case of the $\textrm{PIF}$ when the counterfactual scenario corresponds to the one of the theoretical minimum risk exposure ($RR=1$). We have developed three methods of estimation: empirical, kernel and approximate.

The Empirical Potential Impact Fraction

Assume a Relative Risk $RR:\mathcal{X} \times \Theta \to I \subseteq (0,\infty)$ for exposure $X$ and with parameter $\theta$. Let $X_1, X_2, \dots, X_n$ be a random sample of exposure and covariates $X\in\mathcal{X}\subset\mathbb{R}^p$ with normalized sampling weights $w_1, w_2, \dots, w_n$ and $\hat{\theta} \in \Theta \subseteq \mathbb{R}^q$ estimator of $\theta$ with $\Theta, \mathcal{X}$ compact sets. Define the functions:

\begin{equation} \hat{\mu}n^{\textrm{obs}}(\theta) = \sum\limits{i=1}^{n} w_i RR\big( X_i; \theta \big), \quad \textrm{and} \quad \hat{\mu}n^{\textrm{cft}}(\theta) = \sum\limits{i=1}^{n} w_i RR\big( \textrm{cft}(X_i); \theta \big), \end{equation} then:

\begin{equation}\label{pafestimate} \widehat{\textrm{PIF}} = 1 - \dfrac{\hat{\mu}_n^{\textrm{cft}}(\hat{\theta})}{\hat{\mu}_n^{\textrm{obs}}(\hat{\theta})}, \qquad \textrm{and} \qquad \widehat{\textrm{PAF}} = 1 - \dfrac{1}{\hat{\mu}_n^{\textrm{obs}}(\hat{\theta})} \end{equation} are Fisher-consistent estimators of the PIF and the PAF if $\hat{\theta}$ is Fisher-consistent. Furthermore if the Relative Risk $RR$ is either convex, concave or Lipschitz continuous as a function of $\theta$ and $\hat{\theta}$ is (asymptotically) consistent the estimators have asymptotic consistency.

Kernel Based Potential Impact Fraction

Define the Relative Risk $RR:\mathcal{X} \times \Theta \to I \subset (0,\infty)$ (the additional hypotheses used for the empirical method are not necessary). Let $\hat{f}$ denote a kernel density obtained from the random sample of $X\in\mathcal{X}\subseteq\mathbb{R}^p$. Let $\hat{\theta} \in \Theta \subset \mathbb{R}^q$ be a consistent estimator of $\theta$. We define the functions:

\begin{equation} \hat{\nu}n^{\textrm{obs}}(\theta) = \int\limits{\mathbb{R}^p} RR( x; \theta)\hat{f}(x)dx, \quad \textrm{and} \quad \hat{\nu}n^{\textrm{cft}}(\theta) = \int\limits{\mathbb{R}^p} RR\big( \textrm{cft}(x); \theta\big)\hat{f}(x)dx, \end{equation} then:

\begin{equation} \widehat{\textrm{PIF}} = 1 - \frac{\hat{\nu}_n^{\textrm{cft}}(\hat{\theta})}{\hat{\nu}_n^{\textrm{obs}}(\hat{\theta})} \end{equation} is a consistent estimator of the Potential Impact Fraction ($\textrm{PIF}$).

Approximate Empirical Potential Impact Fraction

Sometimes researchers do not have a random sample of the exposure $X$; nevertheless, they possess $m$, $s^2$ estimators of the exposure's mean and variance (respectively). Furthermore, assume that for each $\theta \in \Theta$ the Relative Risk function $RR(\cdot, \theta)$ has a second order Taylor Expansion for all $X \in \mathcal{X}$ and that the counterfactual function is twice differentiable. An approximate point estimate for the PIF is given by the Laplace approximation:

\begin{equation} \widehat{\textrm{PIF}}= 1-\frac{RR\big(\textrm{cft}(m),\theta\big) + \frac{1}{2}\sum_{i=1}^n\sum_{j=1}^n \textrm{Cov}(X_i,X_j)\frac{\partial^2 RR\big(\textrm{cft}(X),\theta\big)}{\partial X_i \partial X_j}\Big|m}{RR(m;\hat{\theta})+\frac{1}{2} \sum{i=1}^n\sum_{j=1}^n \textrm{Cov}(X_i,X_j)\frac{\partial^2 RR\big(X,\theta\big)}{\partial X_i \partial X_j}\Big|_m}. \end{equation}

The approximate method solely requires the sample mean $m$ and variance $s^2$, not the whole sample. If the sample is available, the other methods should be preferred.

Methods in code

All methods have been coded in the method option of the functions paf, pif and related. That is: we can estimate the PIF by different methods specifying the type:

#Data
set.seed(2374)
X        <- as.data.frame(rlnorm(100))
rr       <- function(X, theta){theta*X + 1}
cft      <- function(X){sqrt(X + 1)}
thetahat <- 0.1943

#Empirical
pif(X, thetahat, rr, cft, method = "empirical")

#Kernel
pif(X, thetahat, rr, cft, method = "kernel")

#Approximate
meanX <- as.data.frame(mean(X[,1]))
pif(meanX, thetahat, rr, cft, method = "approximate", Xvar = var(X))

Note that for the approximate method the correct input is mean and variance of X. If no method is specified in pif(X, thetahat, rr, cft) the "empirical" is chosen.

Confidence Intervals

The "bootstrap" confidence method is the recommended method to calculate confidence intervals while using the kernel and empirical point estimate. However this method cannot be used for the approximate point estimate, since only mean and variance are available. Therefore other methods such as the "linear"´ and"loglinear"were developed to calculate the confidence interval of the PIF. The"inverse"and"one2one"`` methods can be used for some cases of the PAF resulting in additional precision. All, but the one to one method, consider $\hat{\theta}$ to be a consistent estimator of $\theta$ such that it is asymptotically normal with mean $\theta$ and variance $\sigma_{\theta}^2$ where $\hat{\sigma}{\theta}^2$ is an estimator of $\sigma{\theta}^2$. The following table shows the methods to estimate confidence intervals and when each of the methods can be used.

|Confidence Interval |Point Estimate|PAF or PIF |Extra Assumptions| |-----------|:----------------------:|:---------:|:---------------------------:| |Bootstrap |Empirical & Kernel | PIF & PAF |None| |Linear |Empirical & Appproximate| PIF & PAF |None| |Loglinear |Empirical & Appproximate| PIF & PAF |None| |Inverse |Empirical & Appproximate| PAF |None| |One to one |Empirical & Appproximate| PAF |$E_X\big[RR(X,\theta)\big]$ is injective in $\theta$|

Remember that to calculate the point estimate in the case of the approximate method the relative risk function $RR(X,\theta)$ and the counterfactual function $\textrm{cft}(X)$ must be continuously differentiable in terms of $X$. To get the confidence intervals of the PIF we calculate the variance of PIF (or of a transformation $f(\textrm{PIF})$). Notice that uncertainty comes from two sources: the exposure $X$ and the Relative Risk's parameter $\theta$. The estimation process is done in three steps for the methods: linear, loglinear, and inverse:

We estimate the variance and expected value of $f(\textrm{PIF})$ as a function of $\theta$.
We use the fact that $\theta$ is asymptotically normal to estimate the unconditional variance:

\begin{equation}\label{conditioningvar} \textrm{Var} \left(A \right) = E_{B} \left[ \textrm{Var} \left( A \left. \right| B \right) \right] + \textrm{Var}_{B} \left[ E\left( A \left. \right| B\right) \right]. \end{equation}

We assume asymptotic normality of $f(\widehat{\textrm{PIF}})$ to estimate confidence intervals.

Further explanation of each of the methods is given below.

Bootstrap

Bootstrap consists on resampling with replacement several times from a given random sample. In this case from the random sample of exposure values $X_1,X_2,\cdots X_n$. For each re-sample $X^{j}=X_1^{j}, X_2^j, \cdots, X_n^{j}$ a value $\theta_j$ is simulated from a normal distribution with mean $\hat{\theta}$ and variance $\hat{\sigma}^2_{\theta}$. For each $X^j$ and $\theta_j$, $\widehat{\textrm{PIF}}_j$ is estimated with the selected method (empirical or kernel). From the $\widehat{\textrm{PIF}}_j$s a confidence interval for $\textrm{PIF}$ is calculated using the pivotal method @wasserman2006nonparametric.

Linear

The linear method considers Taylor's first order approximation (linearization) of $\widehat{\textrm{PIF}}$ and the variance for the $\widehat{\textrm{PIF}}$ is calculated as the variance of the linearization. This approach is better known as the Delta Method @casella2002statistical.

Loglinear

The loglinear method uses the $(1-\alpha)\times 100\%$ confidence interval for $\textrm{log}(1-\textrm{PIF})$. The transformation $1-e^{y}$ (a one to one function) ensures that the confidence interval is at least $(1-\alpha)\times 100\%$ @bar1999confidence.

Inverse

The inverse method can be used only for confidence intervals of the $\textrm{PAF}$. The confidence interval ($\textrm{IC}{RR}$) for $E[RR(X,\theta)]$ is calculated and then transformed to a confidence interval of $\textrm{PAF}$ by $1-1/\textrm{IC}{RR}$. Once again (as in the loglinear case) the transformation $1-1/x$ is injective and thus the transform confidence interval of $1 - 1/\textrm{IC}_{RR}$ is at least $(1-\alpha)$ for the $\textrm{PAF}$ @bar1999confidence.

One to one

The one to one method is similar to the inverse method, since the confidence interval $\textrm{IC}{RR}$ is calculated. The difference lies on how uncertainty on $\theta$ is calculated. The inverse method uses simulations of $\theta$, while the one to one method considers the upper and lower bounds of a $1-\beta$ confidence interval of $\theta$ to calculate a $1-\alpha$ confidence interval $\textrm{IC}{RR}$, where $\alpha>\beta$. This method can only be used if $E[RR(X;\theta)]$ is injective in $\theta$ @bar1999confidence.

Extra option: `force.min`

For the inverse and one to one confidence intervals the option force.min is available. This option guarantees that the lower bound of the confidence intervals of the expected relative risk takes values greater or equal to one. This option is not recommended, as in most cases there is uncertainty on whether the relative risk can be less than 1 (albeit with "small" probability). However this option can be useful when one is absolutely certain the relative risk can't be less than one.

Advanced examples

This section is concerned with more advanced options of the pifpaf package functions. We first analyze how to choose an estimation method; secondly, we show how to choose a confidence interval; finally we show how to work with the plots.

Deciding on the estimation method

The previous section discussed the three estimation methods used in the package. In this section we discuss some advanced options as well as how to choose the method.

Method choosing

|Method |Exposure|Relative Risk |$\theta$ Estimator| |------------|:------------------------:|:-------------------------------------:|:----------------------------:| |Kernel |Continuous |Continuous (One dimensional in pifpaf package)|Asymptotically consistent| |Empirical |Continuous or Discrete |Convex, Concave, or Lipschitz|Asymptotically consistent| |Empirical |Continuous or Discrete |Any|Fisher consistent| |Approximate |Only mean and variance |Twice differentiable & Convex, Concave, or Lipschitz |Asymptotically consistent| |Approximate |Only mean and variance |Twice differentiable |Fisher consistent|

Kernel

A kernel density is an approximation to the probability density function of a random variable constructed from the variable's sample. For instance the following image shows the real density for a normally distributed random variable with mean $0$ and variance $1$ as well as the kernel approximation to said density from a sample of size 45.

#Get a random variable distributed normally
X              <- rnorm(45)
Y              <- seq(-3,3, length.out = 100)
real_density   <- data.frame("Density" = dnorm(Y), "Axis" = Y)

#Approximate via kernel
kernel_density <- data.frame("Density" = density(X)$y, "Axis" = density(X)$x)

#Show the density
ggplot() + 
  geom_line(aes(x = Axis, y = Density, color = "Real density"), data = real_density) + 
  geom_line(aes(x = Axis, y = Density, color = "Kernel density"), data = kernel_density) +
  scale_color_manual("Type", values = c("Real density" = "purple",
                                        "Kernel density" = "tomato3")) +
  theme_classic() + xlab("X") + ggtitle("Real and approximate density of normal distribution from sample of size 45")

There are several kernel types that provide different forms of approximation. For example, consider the following sample:

set.seed(46)
X <- rlnorm(25)

Whose density can be approximated via kernels:

#Check kernel types
kernels         <- c("gaussian", "epanechnikov", "rectangular", 
                     "triangular", "biweight", "cosine", "optcosine")
color_k         <- rainbow(length(kernels))
names(color_k)  <- kernels 

#Data frame
kdata           <- data.frame(matrix(NA, 
                                     ncol = length(kernels) + 1, 
                                     nrow = 250))
colnames(kdata) <- c("X", kernels)
kdata$X         <- seq(-3, 8, length.out = 250)


#Create kernel densities
for(ktype in kernels){
    kernel_density  <- density(X, kernel = ktype)
    mat_approximate <- approx(kernel_density$x, kernel_density$y, 
                              kdata$X, rule = 2)
    kdata[ ,ktype]  <- mat_approximate$y
}

#Create plot
kplot  <- ggplot(kdata, aes(x = X)) 
for(ktype in kernels){
  kplot <- kplot + geom_line(aes_string(y = ktype, color = factor(ktype)))
}
kplot + scale_color_manual("Kernel type", values = color_k) + theme_classic() +
  ylab("Density")

Notice that different kernels have different approximations to the sample's density. Henceforth, if we were to estimate the Potential Impact Fraction, different values would result from different kernels:

X        <- as.data.frame(X)
thetahat <- 1
thetavar <- 0.1
rr       <- function(X, theta){theta*X + 1}
cft      <- function(X){X/2}

#Rectangular kernel
pif.confidence(X, thetahat, rr, cft = cft, method = "kernel", ktype = "gaussian", thetavar = thetavar, nsim = 200)

#Gaussian kernel
pif.confidence(X, thetahat, rr, cft = cft, method = "kernel", ktype = "rectangular", thetavar = thetavar, nsim = 200)

Additional kernel options include bandwith, adjustment, and number of interpolation points. These options are taken directly from the density function.

pif.confidence(X, thetahat, rr, cft = cft, method = "kernel", ktype = "rectangular", bw = "nrd", adjust = 2, n = 1000, thetavar = thetavar, nsim = 150)

Approximate

As stated previously, the approximate method should only be used if the only information known to the researcher is sample mean and variance but the sample is not available. The approximate method works with numerical derivatives from the numDeriv package inheriting its options for derivatives. Consider the theoretical function:

rr <- function(X, theta){ theta[1]*X^2/(X + 1) + theta[2]*X + 1}

Assume the following information is available for $X$:

Xmean    <- as.data.frame(0.365)
Xvar     <- 0.25
thetahat <- c(0.32, 1/4)

The approximate PAF is given by:

paf(Xmean, thetahat, rr, Xvar = Xvar, method = "approximate")

Additional options can be changed to improve the derivation method:

paf(Xmean, thetahat, rr, Xvar = Xvar, method = "approximate", deriv.method = "Richardson", deriv.method.args = list(eps=0.03, d=0.0001, zero.tol=1.e-8, r=4, v=2))

Deciding on the confidence interval

By default, confidence intervals for the empirical and kernel methods are bootstrap; for the approximate method default is loglinear. When calculating confidence intervals for the Population Attributable Fraction additional methods are available: "one2one" and "inverse".

Force min

The force.min option of "inverse" confidence method forces the Population Attributable Fraction's interval to be > 0. This option is not recommended as it artificially reduces the uncertainty around estimates.

X <- as.data.frame(rnorm(100))
paf.confidence(X, 0.12, rr = function(X, theta){exp(theta*X)}, thetavar = 0.1, check_exposure = F, confidence_method = "inverse",force.min = FALSE, nsim = 200)

paf.confidence(X, 0.12, rr = function(X, theta){exp(theta*X)}, thetavar = 0.1, check_exposure = F, confidence_method = "inverse", force.min = TRUE, nsim = 200)

However, there might be cases for which such a confidence interval makes sense.

Working with the plots

Plot for different values of $\theta$

The command pif.plot (paf.plot respectively) allows us to analyze how the PIF (resp. PAF) varies as the values of $\theta$ changes:

X        <- as.data.frame(rbeta(100, 1, 3))
rr       <- function(X, theta){theta*X^2 + 1}
cft      <- function(X){X/1.2}
thetalow <- 0
thetaup  <- 5
pif.plot(X = X, thetalow = thetalow, thetaup = thetaup, rr = rr, cft = cft, mpoints = 25, nsim = 15)

Methods can be specified as in pif:

pif.plot(X = X, thetalow = thetalow, thetaup = thetaup, rr = rr, cft = cft, method = "kernel", n = 1000, adjust = 2, ktype = "triangular", confidence_method = "bootstrap", confidence = 99,
         mpoints = 25, nsim = 15)

Plot options include color and label titles:

pif.plot(X = X, thetalow = thetalow, thetaup = thetaup, rr = rr, cft = cft, colors = rainbow(2), xlab = "Exposure to hideous things.", ylab = "PIF PIF PIF!", title = "This analyisis is the best", mpoints = 25, nsim = 15)

pif.plot is a ggplot object and thus one can work with it as one:

#require(ggplot2)
pif.plot(X = X, thetalow = thetalow, thetaup = thetaup, rr = rr, cft = cft, colors = rainbow(2),
         mpoints = 25, nsim = 15) + theme_dark()

Sensitivity Analysis

The command pif.sensitivity (paf.sensitivity respectively) allows us to analyze how our estimates for the PIF (resp. PAF) would vary if we excluded some part of the exposure sample, the usage would be the following:

#Get sample
X        <- as.data.frame(sample(c("Exposed","Very exposed","Unexposed"), 540, 
                   replace = TRUE, prob = c(0.25, 0.05, 0.7)))

#Theta values
thetahat <- c(1.2, 7)

#RR defined for each category
rr       <- function(X, theta){
  Xnew <- matrix(1, ncol = ncol(X), nrow = nrow(X))
  Xnew[which(X[,1] == "Exposed"),1]      <- theta[1]
  Xnew[which(X[,1] == "Very exposed"),1] <- theta[2]
  return(Xnew)
}

#Counterfactual of stopping the very exposed
cft      <- function(X){
  Xcft                                   <- X
  Xcft[which(X[,1] == "Very exposed"),] <- "Unexposed"
  return(Xcft)
}

#Sensitivity analysis takes some time. 
pif.sensitivity(X = X, thetahat = thetahat, rr = rr, cft = cft, mremove = 18, nsim = 10)

The default sensitivity analysis removes mremove elements from the sample X and re-calculates the pif with them nsim times. It is possible to modify those parameters:

pif.sensitivity(X = X, thetahat = thetahat, rr = rr, cft = cft, nsim = 10, mremove = 18)

Plot options can also be modified. Furthermore, they are also ggplot objects!

pif.sensitivity(X = X, thetahat = thetahat, rr = rr, 
                legendtitle = "This is legendary",
                title = "This feels entitled", xlab = "A boring X axis",
                ylab = "A not so boring Y axis",
                nsim = 10, mremove = 18, colors = cm.colors(4)) +
  theme(axis.line = element_line(colour = "purple"))

The sensitivity analysis can only be performed when a sample of exposure values is available, therefore no sensitivity analysis can be made for the PIF (resp. PAF) when only mean and variance of exposure are known.

Heatmap for counterfactuals

To evaluate scenarios a heatmap is a useful tool, for one can have an idea of the possible outcomes from different counterfactuals. The counterfactual scenarios analyzed by default are those with the counterfactual function $\text{cft}(X)=aX+b$, where $a\in[0, 1]$ and $b\in[-1,0]$. These counterfactual scenarios can be plotted as:

X        <- as.data.frame(runif(100, 0, 2*pi) + 1)
rr       <- function(X, theta){return(abs(X*cos(X + thetahat) + 2))}
thetahat <- pi
pif.heatmap(X = X, thetahat = thetahat, rr = rr, 
            check_rr = FALSE, check_integrals = FALSE, nmesh = 5)

Other counterfactual scenarios can be represented. For example, the same counterfactual function can be analyzed for different values of $a$ and $b$. If $a\in[0.5, 1]$ and $b\in[-3,-1]$

mina <- 0.5
maxa <- 1
minb <- -3
maxb <- -1
pif.heatmap(X = X, thetahat = thetahat, rr = rr, mina = mina, 
            maxa = maxa, minb = minb, maxb = maxb, check_rr = FALSE, 
            check_integrals = FALSE, nmesh = 5)

Not only affine counterfactuals can be shown in the heatmap. You can define your own counterfactual! For example $sin(aX + b)$:

#Notice that counterfactual here must be function of a and b
cft <- function(X, a, b){sin(a*X+b)}

#Counterfactual
pif.heatmap(X=X, thetahat = thetahat, rr = rr, 
            mina = mina, maxa = maxa, minb = minb, maxb = maxb,
            cft = cft, check_rr = FALSE, check_integrals = FALSE,  title = "PIF with counterfactual sin(aX+b)", nmesh = 5)

We can also analyze how the counterfactual changes solely as a function of $a$. For that purpose, set minb and maxb to the same value

#Notice that counterfactual here must be function of a and b
cft <- function(X, a, b){sin(a*X+b)}

#Counterfactual
pif.heatmap(X=X, thetahat = thetahat, rr = rr, 
            mina = mina, maxa = maxa, minb = 2, maxb = 2,
            cft = cft, check_rr = FALSE, check_integrals = FALSE, title = "PIF with counterfactual sin(aX+2)", nmesh = 5)

The title (title), axis names (xlab, ylab), colors (colors), and number of squares in grid (nmesh) can also be changed:

pif.heatmap(X=X, thetahat = thetahat, rr = rr,
            nmesh = 5, title = "Twister counterfactual",
            xlab = "This is X", ylab = "This is not X",
            colors = rainbow(5), check_rr = FALSE, check_integrals = FALSE)

Counterfactual plot

The counterfactual.plot function allows the user to plot the effect of the counterfactual scenario over the observed distribution of the exposure X if X is univariate.

#Get the exposure
X   <- as.data.frame(rnorm(1000, 150, 15))
cft <- function(X){0.35*X + 75}  

#Plot!
counterfactual.plot(X, cft)

We can analyze the change of a specific subpopulation by using fill_limits:

#Plot!
counterfactual.plot(X, cft, fill_limits = c(150, Inf))

We can further make changes to the plot's appearance:

#Plot!
plot_cft <- counterfactual.plot(X, cft, fill_limits = c(150, Inf),
                                xlab  = "Usual SBP (mmHg)",
                                ylab  = "Proportion of population (%)",
                                legendtitle = "Distribution",
                                dnames = c("Current","After policy"),
                                title = paste0("Effect of a non-linear hazard function and choice",
                                               "\nof baseline on total population risk", 
                                               "\n(Fig 25 from Vander Hoorn et al)"),
                                fill = TRUE, colors = c("blue","purple")) 
plot_cft

Objects from counterfactual.plot are ggplot objects:

plot_cft + geom_segment(aes(x = 168, y = 0.01, xend = 132, yend = 0.025), 
                        arrow = arrow(length = unit(0.25, "cm")))

The function automatically determines if the input is continuous or discrete:

X   <- data.frame(Exposure = sample(c("Exposed","Unexposed"), 
                      100, replace = TRUE, prob = c(0.3, 0.7)))
cft <- function(X){

     #Find which indivuals are exposed
     exposed    <- which(X[,"Exposure"] == "Exposed")

     #Change 1/3 of exposed to unexposed
     reduced    <- sample(exposed, length(exposed)/3)
     X[reduced,"Exposure"] <- "Unexposed"

     return(X)
}  
counterfactual.plot(X, cft)

In the discrete case one can specify the order of the X-axis:

counterfactual.plot(X, cft, x_axis_order = c("Unexposed", "Exposed"))

One should be careful when including exposure variables $X$ that have been coded as numeric but represent discrete cases as the following example shows. In those cases, the exposure.type option saves the day.

#Same example as before but now exposed has been coded as 1 and unexposed 0
X   <- data.frame( Exposure = sample(c(1,0), 100, 
                    replace = TRUE, prob = c(0.3, 0.7)))

#Same counterfactual considering the new code
cft <- function(X){

     #Find which indivuals are exposed
     exposed    <- which(X[,"Exposure"] == 1)

     #Change 1/3 of exposed to unexposed
     reduced    <- sample(exposed, length(exposed)/3)
     X[reduced, "Exposure"] <- 0

     return(X)
}  

#One should specify exposure is discrete 
counterfactual.plot(X, cft, exposure.type = "discrete")

Common error messages

Unused argument (thetahat)

Error Error in rr(.X0, thetahat) : unused argument (thetahat) states thetahat was not included in the definition of the rr function.

X  <- data.frame(rlnorm(100))
rr <- function(X){X + 1}
paf(X, 0, rr)

The solution is to include it:

X  <- data.frame(rlnorm(100))
rr <- function(X, theta){X + 1}
paf(X, 0, rr)

Relative Risk must equal 1 when evaluated in 0

Warning Relative Risk by definition must equal 1 when evaluated in 0. Are you using displaced RRs? establishes the Relative Risk is not $1$ when the exposure $X$ is set to $0$. This is just a reminder to avoid careless definitions of Relative Risk functions; however it is not always the case that exposure ought to be $0$ as the Systolic Blood Pressure example establishes. If that is the case, set check_rr to FALSE:

pif(data.frame(rlnorm(100)), 0, function(X, thetahat){X}, check_rr = FALSE)

Some exposure values are less than zero

Warning Some exposure values are less than zero, verify this is correct. establishes there are negative elements of the exposure $X$. Due to the physical interpretation of exposure it might not make sense to have negative values. However, if your measurement of exposure includes negative values you can stop that message with check_exposure to FALSE.

paf(data.frame(runif(100, -1, 1)), 0, rr = function(X, theta){exp(X)}, check_exposure = FALSE)

Counterfactual is increasing the Risk

Warning Counterfactual is increasing the Risk. Are you sure you are specifying it correctly? establishes that under current counterfactual the Relative Risk associated to that exposure is increasing (and not decreasing as the usual practice of setting counterfactuals that reduce Risks would suggest). To dismiss the warning, set check_integrals = FALSE.

pif(data.frame(rbeta(100, 2, 3)), 0, rr = function(X, theta){exp(X)}, cft = function(X){2*X}, check_integrals = FALSE)

Expected value is not finite

Both pif and paf are, by definition, expected values. As such, there are certain distributions for which the theoretical expected values do not exist. The warnings Expected value of Relative Risk is not finite, and Expected value of Relative Risk under counterfactual is not finite establish that the estimator of the expected value of the Relative Risk of the exposure (or the Relative Risk over the counterfactual exposure) is (for all computational purposes) infinite.

paf(data.frame(rlnorm(100, 23, 12)), 1, rr = function(X, theta){exp(theta*X)})

If both the Relative Risk of the observed exposure and the one over the counterfactual exposure are infinite then PIF is undefined:

pif(as.data.frame(rlnorm(100, 23, 12)), 1, rr = function(X, theta){exp(theta*X)}, cft = function(X){X/2})

Confidence intervals might also be undefined in those cases:

paf.confidence(as.data.frame(rlnorm(100, 23, 12)), 1, rr = function(X, theta){exp(theta*X)}, 
               thetavar = 0.2, confidence_method = "inverse", nsim = 50)

or might be useless:

paf.confidence(as.data.frame(rlnorm(100, 23, 12)), 1, rr = function(X, theta){exp(theta*X)}, 
               thetavar = 0.2,confidence_method = "linear", nsim = 30)

Error in weighted.mean

Error Error in weighted.mean.default(rr(.X, thetahat), weights) : 'x' and 'w' must have the same length establishes that the weighted.mean of the relative risk evaluated at the exposure rr(X, thetahat) using the weights weights cannot be estimated. This may be due to several causes:

#Survey weights have different length than exposure
X <- data.frame(runif(100))
w <- rep(1/12, 12)
pif(X, 0.12, rr = function(X, theta){theta*X + 1}, weights = w)

#The Relative Risk function has a different length than exposure X
X  <- data.frame(runif(100))
rr <- function(X, theta){1}
pif(X, 8, rr)

#The counterfactual function result might have a different length than exposure X
X   <- as.data.frame(runif(100))
rr  <- function(X, theta){X + 1}
cft <- function(X){2}
pif(X, 8, rr = rr, cft = cft)

Hessian might not be defined

Error message: Hessian might not be defined for those values of rr states that it was impossible to numerically estimate the Hessian for the approximate expected value of the Relative Risk. There are two main reasons for this: 1) values of the rr are extremely large or 2) the relative risk function is not differentiable.

#Extremely large values of rr
pif(as.data.frame(2.61573e+22), 1, rr = function(X, theta){exp(theta*X)}, cft = function(X){X/2},
    Xvar = 100, method = "approximate")

#rr not differentiable at 0
rr <- function(X, theta){
  sqrt(X)
}
paf(as.data.frame(0), 1, rr = rr, method = "approximate", Xvar = 1, check_rr = FALSE)

Under this density some values are NA

Warning message Under this kernel density some values of cft are NA and Under this kernel density some values of rr are NA indicate that the adjusted density is in a domain where the rr or the cft are not defined. For example consider the following rr and X:

X  <- rlnorm(100)
rr <- function(X, theta){sqrt(X) + 1}

By construction, all values of X are positive; however, the "gaussian" kernel density adjusts the density to include some negative values:

densLnorm <- density(X)
dens_data <- data.frame(x = densLnorm$x, y = densLnorm$y)
ggplot(dens_data) + geom_line(aes(x = x, y = y), color = "deepskyblue4", size = 1) + theme_classic() +
  xlab("Exposure X") + ylab("Density") + ggtitle("Gaussian kernel adjusted to exposure data")

When kernel method is applied for paf, the error occurs as it attempts to take square root of the negative values.

paf(as.data.frame(X), 0.12, rr = rr, method = "kernel", ktype = "gaussian")

There are two ways of fixing this: using the empirical method or changing the Relative Risk function:

#Using empirical method
paf(data.frame(X), 0.12, rr = rr, method = "empirical")

#Rewriting RR function
rr <- function(X, theta){
  Xnew            <- as.data.frame(rep(0, nrow(X)))
  Xnew[which(X[,1] >= 0),1] <- sqrt(X[which(X[,1] >= 0),1] ) + 1
  return(Xnew)
}
paf(data.frame(X), 0.12, rr = rr, method = "kernel", ktype = "gaussian")
pif.kernel(as.data.frame(X), 0.12, rr)

References

Any scripts or data that you put into this service are public.

pifpaf documentation built on May 1, 2019, 9:11 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

pifpaf Potential Impact Fraction and Population Attributable Fraction for Cross-Sectional Data

Introduction to the pifpaf package: Estimating Potential Impact and Population Attributable Fractions from Cross-Sectional data In pifpaf: Potential Impact Fraction and Population Attributable Fraction for Cross-Sectional Data