fragilityindex is a package that implements and extends the fragility index calculation as described in Walsh et al. (2014).


``` {r, echo = FALSE} library(knitr) knitr::opts_chunk$set(collapse=TRUE,comment=NA)

# Introduction

## Theory

In randomized controlled trials, our belief in whether a treatment may have a real effect is often influenced by whether hypothesis testing demonstrates statistical significance. In the null hypothesis significance testing framewor, we reject the null hypothesis when the P-value is below a predetermined threshold. Rejecting the null hypothesis implies that the differences between treatment groups are unlikely to have occured due to chance.

Per Walsh et al. (2014), the use of threshold P-values may be an overly simple concept to determine the existence of a treatment effect. They are limited as a tool since readers may interpret P-values of similar magnitude similarly, regardless of sample size or number of trials. Also, P-values near the threshold value are interpreted very differently, depending on whether they are above or below it.

Walsh et al. (2014) suggest the use of an additional metric that demonstrates how easily a significant result can lose statistical signifance as a result of additional events being added to the results. This metric, the fragility index, indicates fragile results with smaller numbers, and non-fragile results with large ones.

The fragility index estimates how fragile statistical results are to small perturbations in event outcomes. This package implements an extension to the fragility index calculation for dichotomous results that can be used to estimate the fragility of statistical significance of predictors in linear, logistic, and Cox proportional hazards (survival analysis) regression models.

## Multivariate Fragility

### Concept
Potential covariates for multivariate regressions are often assessed for significant effects using hypothesis testing with the Wald t test. This allows the concept of the fragility index to be extended to models. However, since the majority of data does not have dichotomous outcomes, we cannot simply flip outcomes like the fragility index proposed by Walsh et al. (2014) calls for. Even for models with dichotomous outcomes, such as those used in survival analysis, specific event outcomes are associated with specific survival times and cannot be simply permuted.

Since a fragile test result is one which could have resulted in a p-value greater than alpha with slight dataset perturbation, the fragility index of a covariate in a multivariate regression can be defined as the minimum number of points one must remove before a model with that covariate is not significantly different from a model without that covariate.

### Comparing Function Outputs
The functions `logisticfragility`, `linearfragility`, and `survivalfragility` return approximations for fragility indices of multivariate regressions. These estimates should not be compared between different model types since each of the functions has a slightly different method of making these approximations. Generally, they search for points which favor the full model over the one missing a covariate and remove these points until the models are not distinguishable at the given confidence level. 

Covariates of similar fragility will yield the lowest fragility indices for logistic models and the highest for Cox P-H regressions in survival analysis. The algorithms for prioritizing points for removal from the dataset are closest to optimal for logistic regressions, while Cox P-H regressions have two response variables, weakening point selection and inflating fragility index values. We recommend comparing fragility index values within models, rather than between models. Oftentimes, differences in sample sizes also make comparisons of fragility between models is relatively meaningless. The most effective ways to use fragility index approximations are described in the sections for each function. The algorithms used to find them are also briefly described. The algorithm for `logisticfragility` is described in the "Multivariate Fragility Index Algorithm" section. It can be generalized to understand the other multivariate functions.


## *fragilityindex* Functions

Four of the functions in *fragilityindex* compute fragility indices for specific hypothesis tests:

* `fragility.index`
    + Implements fragility index per Walsh et al. (2014).
    + Estimates fragility of hypothesis testing using Fisher's exact test and Pearson's chi-squared test of homogeneity on dichotomous results.
* `logisticfragility`
    + Estimates fragility of significance of effects of selected predictor variables for logistic regression using Likelihood-ratio test.
* `linearfragility`
    + Estimates fragility of significance of effects of selected predictor variables for linear regression using Likelihood-ratio test.
* `survivalfragility`
    + Estimates fragility of significance of effects of selected predictor variables for Cox proportional hazards regression model using Pearson's chi-squared test of goodness of fit. 

`revfragility.index` estimates the fragility of nonsignificant results from a test of homogeneity on dichotomous results. It can be seen as the reverse of `fragility.index` since it reports the number of flipped outcomes to make a nonsignificant outcome significant.

---

# Functions
```r
library(fragilityindex)

fragility.index()

fragility.index(intervention_event, control_event, intervention_n, control_n, conf.level=0.95, verbose=FALSE, print.mat=FALSE)

We can use fragility.index to estimate the fragility of hypothesis testing using Pearson's chi-squared test or Fisher's exact test on dichotomous data from two groups.

The following table presents data for which the fragility index can provide insight. The data is made up of two groups, one which received the experimental treatment, while the other recieved the control treatment. The control treatment group has 5 recorded events with a sample size of 40 individuals and the experimental treatment group has 15 events with a sample size of 41 individuals.

\ |Control|Experimental ----------------|-------|------------ Events |5 | 15 Sample Size |40 | 41

Formatting the data for Chi-squared test:

control_events <- 5; control_sample <- 40
exp_events <- 15; exp_sample <- 41

mat <- matrix(c(control_events, exp_events, control_sample - control_events, exp_sample- exp_events), byrow = TRUE, 2, 2)

First we conduct a Pearson's Chi-squared test at an alpha of 0.05.

chisq.test(mat)

``` {r echo=FALSE} alpha <- 0.05 pval <- signif(chisq.test(mat)$p.value,digits = 3) if (pval < alpha) { bool <- "less" sig <- "significant" } else{ bool <- "greater" sig <- "not significant" }

This gives us a p-value of `r pval`, which is `r bool` than our alpha threshold of `r alpha`, indicating that the difference between the control and experimental groups is `r sig`. To check the fragility of this result, we can use the `fragility.index` function.

```r
fragility <- fragility.index(exp_events, control_events, exp_sample, control_sample, conf.level = 1 - alpha)

A fragility index of r fragility is concerning for a sample size of r control_sample. It means that if r fragility more members of the control group experienced an event, we would have found the groups to not be statistically significantly different. Especially in cases where subjects are lost due to failure to follow-up or other random events which prevent them from being recorded as a success or failure, this is very concerning. To see the changes done to the data, we can set print.mat = TRUE.

fragility.index(exp_events, control_events, exp_sample, control_sample, print.mat = TRUE)

The fragility index is stored under $index in a list. We can see the number of events increase by 1 each iteration while the number of non-events decreases by 1. The final matrix printed has nonsignificant differences between the two groups.

If we want to record the p-value associated with each iteration of flipping a non-event to event, we can set verbose = TRUE. This will cause the function to output p-values for each iteration, up to and including the iteration which makes the groups not significantly different.

fragility_object = fragility.index(exp_events, control_events, exp_sample, control_sample, verbose = TRUE)
fragility_object

The p-value next to the 0 entry for fragility.index is the p-value for Fisher's exact test on the original data. If the two groups are not statistically significantly different at the given confidence level with this test, fragility.index() will return 0 as the fragility index.

revfragility.index

revfragility.index(intervention_event, control_event, intervention_n, control_n, conf.level=0.95, verbose=FALSE, print.mat=FALSE)

When hypothesis testing of dichotomous data using Pearson's chi-squared test or Fisher's exact test fails to reject a null hypothesis, we can use revfragility.index to estimate the fragility of the test's conclusion. We use the term "reverse fragility" to refer to the fragility of a nonsignificant result from hypothesis testing.

The following table presents dichotomous data from two different groups. The control treatment group has 5 recorded events with a sample size of 40 individuals and the experimental treatment group has 15 events with a sample size of 41 individuals.

\ |Control|Experimental ----------------|-------|------------ Events |5 | 15 Sample Size |40 | 41

Formatting the data for Chi-squared test:

control_events <- 5; control_sample <- 40
exp_events <- 15; exp_sample <- 41

mat <- matrix(c(control_events, exp_events, control_sample - control_events, exp_sample- exp_events), byrow = TRUE, 2, 2)

First we conduct a Pearson's Chi-squared test at an alpha of 0.01.

chisq.test(mat)

``` {r echo=FALSE} alpha <- 0.01 pval <- signif(chisq.test(mat)$p.value,digits = 3) if (pval < alpha) { bool <- "less" sig <- "significant" } else{ bool <- "greater" sig <- "not significant"

}

This gives us a p-value of `r pval`, which is `r bool` than our alpha threshold of `r alpha`, indicating that the difference between the control and experimental groups is `r sig`. To check the reverse fragility of this result, we can use the `revfragility.index` function.

``` {r results="hide",message=FALSE,warning=FALSE}
library(fragilityindex)
revfragility <- revfragility.index(exp_events, control_events, exp_sample, control_sample, conf.level = 1 - alpha)
revfragility

This gives a reverse fragility index of r revfragility. Even though this means that just one flipped outcome would have yielded different results, this is not a particularly concerning reverse fragility index. The only conclusion that can be made from such high reverse fragility is that further testing with a much larger sample size could be considered. If looking to repeat data collection, sample size should be increased to avoid Type I error. Especially in cases where subjects are lost, preventing data collection of their outcomes, replication could be considered.

Similarly to the fragility.index function, we can set print.mat = TRUE to see the changes done to the table of data to simulate a significant result.

The reverse fragility index is stored under $index in a list.

If we want to record the p-value associated with each iteration of flipping an event to a non-event, we can set verbose = TRUE. This will cause the function to output p-values for each iteration, up to and including the iteration which makes the groups significantly different.

fragility_object = revfragility.index(exp_events, control_events, exp_sample, control_sample, verbose = TRUE, conf.level = 1 - alpha)
fragility_object

The p-value next to the 0 entry for fragility.index is the p-value for Fisher's exact test on the original data. If the two groups are statistically significantly different at the given confidence level with this test, revfragility.index() will return 0 as the reverse fragility index.

revfragility.index(exp_events, control_events, exp_sample, control_sample, verbose = TRUE, conf.level = 0.95)

Multivariate Fragility Functions

logisticfragility(formula, data, covariate = "all.factors.default", conf.level = 0.95, verbose = FALSE)
survivalfragility(formula, data, covariate = "all.factors.default", conf.level = 0.95, verbose = FALSE)
linearfragility(formula, data, covariate = "all.factors.default", conf.level = 0.95, verbose = FALSE)

We will be using logisticfragility on the heart_disease dataset for this example, but the parameters are identical to those of survivalfragility and linearfragility. One should not compare fragility index outputs between the three multivariate fragility functions (see "Comparing Function Ouputs" section for explanation).


The num variable reflects heart disease status and is originally in a severity scale from 1-4. For this analysis, we will transform the outcome into a binary 0/1 outcome since logisticfragility() passes the formula input to glm(family="binomial").

library(car)
mydata = heart_disease
mydata$num <- recode(mydata$num, "1:4=1")

First we perform an initial logistic regression. (For survivalfragility this must be a formula which can be passed to coxph() and for linearfragility this must be a formula which can be passed to lm(). These formulas should not have interaction terms.)

formula <- num ~ age + sex+ cp + chol + fbs +thalach + ca
glmfit <- glm(formula, family = "binomial", data = mydata)
summary(glmfit)

We can see that age and fbs are not significant at an alpha of 0.05 so the fragility indices of those covariates will be 0, meaning 0 points must be removed to make them nonsignificant.

logisticfragility(formula, data = mydata)

Covariates with very low p-values, such as sex and cp typically will be less fragile than those with p-values close to the threshold, such as chol. It only takes removing two points to make chol nonsignificant while it takes removing 24 points to do the same for sex.

If we only want to know the fragility index of a subset of covariates, we can enter their names, as they are written in the formula, into the covariate parameter. This will make the function run faster since it will not do unneeded calculations.

logisticfragility(formula, data = mydata, covariate = c("chol", "thalach") )

Since the function works by removing points which are most important in making a covariate significant, it can be valuable to examine the points. We can set verbose = TRUE to receive an output including removed points, and the resulting p-values from those removals.

logisticfragility(formula, data = mydata, covariate = c("fbs", "thalach"), verbose = TRUE)

Note that fbs has a fragility index of 0, so the verbose results state that no points were removed. The original p-value for the Likelihood Ratio Test is also returned.

We may wish to do further analysis of the removed points if we notice any trends, such as all of the removed points coming from a single level of a factor. This is especially the case when the covariate we are examining is nominal since the algorithm oftentimes removes all of the points from a single category, until that category no longer exists in the dataset. When this is the case, we would not interpret the conclusion to be particularly fragile unless the probability of including someone from that category is low. When that is the case, the covariate's significance is fragile since random chance can exclude those relevant subjects from the sample.

It would be fair to say that the conclusion of significance is fragile for chol, while significance is not fragile for the other covariates examined. If a result is fragile, it may call for replicating the collection of data with a larger sample size.

If we wish to examine fragility at a different alpha, we can set the conf.level parameter to 1 - alpha. The default is conf.level = 0.95 since an alpha of 0.05 is most common.

Warning: especially for extremely low p-values or large datasets with many covariates, running these functions can take over a minute. For some low p-values, the survivalfragility function does not converge. In these cases, warnings are returned.

Multivariate Fragility Index Algorithm

This section provides an explaination for the method for calculating the fragility index output for the logisticfragility function. this method can be generalized for the linear and survival fragility functions.

Simulating bivariate logistic data:

n <- 100; b0 <- -5; b1 <- .055; b2 <- .2
x <- runif(n, min = 1, max = 100)
x2 <- runif(n, min = 1, max = 20)
pi <- (exp(b0 + b1 * x + b2 * x2)) / (1 + exp(b0 + b1 * x + b2 * x2))
y <- rbinom(n, 1, pi)
mydata <- data.frame(x, x2, y)
names(mydata) <- c("Var", "Covar", "Status")
head(mydata)


Status is the response variable and Var and Covar are covariates for our logistic model. Note that Status has dichotomous outcomes.

model <-  glm(Status ~ Var + Covar, mydata, family = "binomial")
summary(model)

Both covariates are found to be significant at a 95% confidence level. The goal of the logisticfragility function is to determine the number of points on must remove to make the covariates nonsignificant at our chosen confidence level. The logistic model generated by glm for our data shows a significant relationship between Var and Response.

``` {r, echo = FALSE} plot(Status ~ Var, data = mydata, xlab = "Var", ylab = "Response", main = "Logistic Model") xsim <- seq(from = 1, to = 100, by = 1) xsim <- sort(rep(xsim,times = 10)) x2sim <- seq(from = 1, to = 20, length.out = 10) x2sim <- rep(x2sim, times = 100) ysim = predict(model, data.frame(Var = xsim, Covar = x2sim), type = "response") lines(predict(model, data.frame(Var = xsim, Covar = rep(mean(x2sim),length(x2sim))), type = "response")~xsim)

A model with a nonsignificant p-value for a covariate is one not found to be significantly different from a model without that covariate. We shall call that model `nullmodel`.
``` {r}
nullmodel <- glm(Status ~ Covar, mydata, family = "binomial")
summary(nullmodel)


Using the Likelihood Ratio Test, we can see that the two models are significantly different.

anova(model,nullmodel,test = "LRT")


Rather than removing points randomly, it is possible to isolate points which favor the complete model with the covariate over the model without it. There are four cases we can sort all of the points into:

  1. Points with Response = 1 at values of Var where model predicts a higher probability than nullmodel.
  2. Points with Response = 1 at values of Var where model predicts a lower probability than nullmodel.
  3. Points with Response = 0 at values of Var where model predicts a higher probability than nullmodel.
  4. Points with Response = 0 at values of Var where model predicts a lower probability than nullmodel.

``` {r, echo = FALSE} plot(Status ~ Var, data = mydata, xlab = "Var", ylab = "Response", main = "Logistic Model",type = "n") ysim = predict(model, data.frame(Var = xsim, Covar = x2sim), type = "response") lines(predict(model, data.frame(Var = xsim, Covar = rep(mean(x2sim),length(x2sim))), type = "response")~xsim, col = "red", lwd = 3) lines(rep(mean(predict(nullmodel, data.frame(Covar = x2sim), type = "response")),length(xsim)) ~ xsim, col = "blue" , lwd = 3) legend("right",legend = c("Model", "Null Model"), col = c("red","blue"),lty=c(1,1),lwd = c(3,3)) legend("left",title = "Case", legend = c("1", "2", "3", "4"), col = c("green","purple","orange","gray"),pch = 20, cex = 1.5) case12 = mydata[which(mydata$Status==1),] case34 = mydata[which(mydata$Status==0),] case12$modelpred = predict(model,case12, type = "response") case12$nullpred = predict(nullmodel, case12, type="response") case34$modelpred = predict(model,case34, type = "response") case34$nullpred = predict(nullmodel, case34, type="response") case1 = case12[which(case12$modelpred > case12$nullpred),] case2 = case12[which(case12$modelpred < case12$nullpred),] case3 = case34[which(case34$modelpred > case34$nullpred),] case4 = case34[which(case34$modelpred < case34$nullpred),] points(case2$Status~case2$Var, col = "purple", xlim = c(0,100), ylim = c(0,1), pch = 20, cex = 1.5) points(case3$Status~case3$Var, col = "orange", xlim = c(0,100), ylim = c(0,1), pch = 20, cex = 1.5) points(case1$Status~case1$Var, col = "green", xlim = c(0,100), ylim = c(0,1), pch = 20, cex = 1.5) points(case4$Status~case4$Var, col = "gray", xlim = c(0,100), ylim = c(0,1), pch = 20, cex = 1.5)

The two lines shown are simplified versions of `model` and `nullmodel`, predicting `Response` using the mean of `Covar` in addition to `Var`. The points are categorized into cases using the unsimplified model so points with values of `Var` near the boundary of `model` = `nullmodel' cannot be predicted simply by observing the plot.  

The removal of points in Cases 1 and 4 results in the convergence of `model` and `nullmodel`. The full model has higher predictive power for these points than the model without `Var`. These points can be found by taking the difference in residuals from the two models. 
Residuals in Cases 1 and 2 are positive while those in Cases 3 and 4 are negative. The residuals of `model` minus the residuals of `nullmodel` are positive in Cases 2 and 4 while those in Cases 1 and 3 are negative. By selecting points where the difference in residuals is negative with `Response` = 1 and where the difference in residuals is positive with `Response` = 0, we can isolate points which lead to a nonsignificant difference in `model` and `nullmodel` by the Likelihood Ratio Test. When selected in order by magnitude of difference in residuals, these two models converge quickly. 

Using this method, the `logisticfragility` function returns that removing `r logisticfragility(Status~Var+Covar,mydata)$Var$fragility.index` points can lead to a model in which `Var` is not statistically significant.

It must be remembered that the removal of points will inevitable lead to a model being found to have nonsignificant covariates, regardless of the points that are selected. This algorithm attempts to find the fastest path to this result.

```r
logisticfragility(Status ~ Var + Covar, mydata)

The points removed with this method, and the resulting p-value after removing each point are:

verboseresults <- logisticfragility(Status ~ Var + Covar, covariate = "Var", mydata, verbose = TRUE)
verboseresults

The points selected for removal are identified in red on this scatterplot.

``` {r, echo = FALSE} points = verboseresults$Var$point.diagnostics$Var removed = verboseresults$Var$point.diagnostics[,1:3] plot(Status ~ Var, data = mydata, xlab = "Var", ylab = "Response", main = "Logistic Model") points(removed$Status~removed$Var, col = "red", xlim = c(0,100), ylim = c(0,1), pch = 19) legend("right", legend = "Removed Points", col = "red", pch = 19)

The new model fit to the perturbed data is `nonsigmodel` while the new null model is `nonsignullmodel`. `Var` is no longer statistically significant while `Covar` is, since points which demonstrate `Var`'s explanatory effect on variation within `Response` have been removed.

``` {r}
newdata = mydata[ ! mydata$Var %in% points, ]
nonsigmodel <-  glm(Status ~ Var + Covar, newdata, family = "binomial")
nonsignullmodel <- glm(Status ~ Covar, newdata, family = "binomial")
summary(nonsigmodel)

It can be seen that these models have converged.

``` {r, echo = FALSE} plot(Status ~ Var, data = newdata, xlab = "Var", ylab = "Response", main = "Logistic Model") ordereddata = newdata[order(newdata$Var),] lines(predict(nonsigmodel, data.frame(Var = ordereddata$Var, Covar = rep(mean(ordereddata$Covar),nrow(ordereddata))), type = "response") ~ ordereddata$Var, col = "red", lwd = 3) lines(rep(mean(predict(nonsignullmodel, data.frame(Covar = ordereddata$Covar), type = "response")),nrow(ordereddata)) ~ ordereddata$Var, col = "blue",lwd = 3) legend("bottom",legend = c("Model", "Null Model"), col = c("red","blue"),lty=c(1,1),lwd = c(3,3))

If we run an ANOVA, comparing a model with `Var` and a model without `Var`, we will find that `Var` is no longer statistically significant after the data perturbation.

``` {r}
anova(nonsigmodel,nonsignullmodel,test = "LRT")

References

Walsh, Michael, et al. "The Statistical Significance of Randomized Controlled Trial Results Is Frequently Fragile: a Case for a Fragility Index." Journal of Clinical Epidemiology, vol. 67, no. 6, 2014, pp. 622-628., doi: 10.1016/j.jclinepi.2013.10.019.



kippjohnson/fragilityindex documentation built on May 20, 2019, 10:23 a.m.