RFPM"
In RFPM: Floating Percentile Model

require(dplyr)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

RFPM provides an implementation of the Floating Percentile Model (FPM) in the R statistical environment. The FPM was originally developed by others on behalf of the Washington State Department of Ecology (Avocet 2003) and has since been recommended for use in Washington, Oregon, and Idaho (Ecology 2011). The FPM currently exists as a Microsoft Excel macro (written with Visual Basic for Applications). RFPM represents an independent effort to port the FPM to R with the purpose of:
- correcting a data handling error
- streamlining the chemical selection and sediment quality benchmark calculation procedures
- providing new tools to rapidly evaluate and refine benchmarks
- improving several aspects of the FPM algorithm
- improving the availability, accessibility, and transparency of the FPM tool

The purpose of this vignette is to lead the reader through an example analysis of real-world data using the core functions of RFPM rather than go into great detail on any particular function. For more function details, see the help files by using ? or help() (or by navigating in RStudio to the Packages tab, selecting RFPM from the list of available packages, and then selecting the function of interest).

Functionality

The main user interface within RFPM is via the FPM function. To use this function, the user must first prepare a sediment chemistry and toxicity data.frame that contains one or more chemical columns and a single toxicity column called Hit. Several example datasets are provided with RFPM with this structure, for example h.northport, which we will evaluate repeatedly in this vignette. Other datasets include the empirical c.northport and h.tristate as well as simulated FPM data perfect, lowNoise, and highNoise.

Other key functions that the user is directed to include optimFPM, cvFPM, and chemVI which can help the user to explore and refine FPM inputs and optimize FPM outputs.

Case study - Amphipod toxicity in Northport, WA sediments

In this vignette, we focus on evaluating a case study dataset called h.northport, which includes a data.frame of bulk sediment chemistry metals concentrations (mg/kg) and Hit data. Hits are logical results indicating whether a particular sample was deemed toxic (TRUE) or not (FALSE) based on the Hyalella azteca biomass endpoint in 28-day laboratory exposures. The definition of a toxic Hit is up to the practitioner and not a part of the model. Data must be input to FPM in a cross-tab (wide) format, meaning each sediment chemical has a unique column and each sample has just one row.

library(RFPM)

head(h.northport)

To run FPM, you will need to specify, at a minimum, the data.frame object and the column names associated with sediment chemistry data. As noted above, FPM also requires that the data.frame include a column called Hit. Columns other than Hit and those specified using the paramList argument are ignored.

p.northport <- names(h.northport)[1:10] ## all chemical column names
FPM(data = h.northport, paramList = p.northport) ## minimum input - dataset and chemical column names

The output of FPM is a list with at least 1 item called FPM, which displays three types of data:
- The sediment quality benchmark for each of the chemicals selected by FPM as significant (meaning that concentrations when Hit == TRUE are significantly greater than when Hit == FALSE).
- Selected FN_crit value, the user-defined limit on benchmark conservatism (default = 0.2)
- Classification metrics: e.g., false negatives rate (pFN), false positive rate (pFP), sensitivity (sens), specificity (spec), overall reliability (OR), Fowlkes-Mallows Index (FM), and Matthew's Correlation Coefficient (MCC). These provide an indication of the relative fit of the FPM benchmarks to the underlying toxicity dataset.

This is technically all that was required to run the FPM on this example dataset. There is of course a lot going on behind the scenes. Functions that were used within FPM include:
- chemSig: chemical selection algorithm, which evaluates distribution assumptions of normality and equal variance, then applies appropriate hypothesis test methods to identify significant differences between concentrations when Hit == TRUE and Hit == FALSE. The output of chemSig is a logical vector indicating which chemicals are significant (therefore selected by FPM for generating benchmarks).
- chemSigSelect: a function that uses chemSig but instead returns the chemistry data for significant chemicals and has options for plotting chemistry data distributions to compare Hit == TRUE and Hit == FALSE subsets. Setting plot = TRUE in FPM turns on plot outputs (by passing that argument to chemSigSelect).

As an example, see the following figures generated by plot.chemSigSelect (identical to those generated by FPM when plot = TRUE) for a significant chemical (Cr), shown with blue points, and a non-significant chemical (Al), shown with green points:

plot(chemSigSelect(h.northport[, c("Al", "Cr", "Hit")], paramList = c("Al", "Cr")), type = "boxplot")

There are also many arguments that can be adjusted in FPM that affect how chemSig and chemSigSelect function or how the FPM algorithm itself functions. Users can adjust the test methods/assumptions for selecting chemicals, force the FPM to accept the assumptions of the original Excel-based function (i.e., ExcelMode == TRUE), etc. As noted above, the output of FPM is a list; additional items can be appended to the list if the user sets the densInfo, lockInfo, or hitInfo arguments equal to TRUE. Respectively, these provide information about 1) how much each of the benchmarks varied within the FPM algorithm, 2) what caused each benchmark to lock into place and in what order, and 3) what the predicted Hit values would be based on the calculated benchmarks. We have been interested in each of these components at various times of model development, but expect the typical user to not need this information.

See ?FPM and ?chemSig for more information.

Optimizing FPM Inputs

In RFPM, there are currently two optimization functions available:
- optimFPM is intended to find an optimal input (to FPM) of the FN_crit and/or alpha.test parameter. These optimal levels are based on the full dataset (input as data argument) and, therefore, may be over-specific to that data. The FN_crit and alpha.test arguments should be input either as ranges or single values depending on how you want the optimization to run. Inputting a single value for FN_crit and a range for alpha.test would optimize the alpha.test value only and vice-versa. Inputting ranges for both values results in slightly more complex output owing to the 2-dimensional optimization.
- cvFPM is similarly parameterized to optimize FN_crit and/or alpha.test inputs using a cross-validation (CV) approach. The results are "best" values that account for uncertainty in future data. The user can adjust the k parameter to change the number of folds in the CV algorithm (up to k = n, where n is the number of the samples). This affects the smoothness of the mean/median optimization curves by generating more hypothetical datapoints; it also is likely to expand the min/max range of optimization values by providing more chances for extreme outcomes. By using larger training datasets (i.e., with larger k inputs), the CV-based optimum will converge toward the empirical optimization curve generated by optimFPM (and shown in the output of cvFPM). While we generally recommend using cvFPM over optimFPM when possible, there are cases when cvFPM cannot be used. These include the analysis of small datasets and/or when k is set to a small number; in either of these cases, FPM may fail because of a lack of significant chemicals being identified chemSig (dependent on N).

This is an example of the empirical optimization approach with optimFPM:

## one-way optimization of the FN Limit - vertical lines show best values based on two metrics
optimFPM(h.northport, p.northport, FN_crit = seq(0.1, 0.9, 0.05), alpha.test = 0.05)

## two-way optimization of both FN Limit and alpha - black squares show best values based on two metrics
optimFPM(h.northport, p.northport, 
         FN_crit = seq(0.1, 0.9, 0.05), 
         alpha.test = seq(0.01, 0.2, 0.01))

The resulting plots show a mix of potential outcomes. When only one parameter is being optimized (i.e., length(alpha.test) == 1 | length(FN_crit) == 1), then the first plot is produced, showing fit statistics over the evaluated range. Vertical lines identify optimal values in OR, sensitivity/specificity ratio (equal to min(c(sens, spec))/max(c(sens, spec)), FM, or MCC.

In the case that two parameters are being optimized together (i.e., length(alpha.test) > 1 & length(FN_crit) > 1), then a matrix of results is provided representing with heat colors (by default) to indicate the optimization outcome (yellow = suboptimal, red = optimal). Squares indicate the selected values, which are also output into the console.

This is an example of the CV-based optimization approach with cvFPM and 10-folds:

cvFPM(h.northport, p.northport, k = 10, FN_crit = seq(0.1, 0.9, 0.05), alpha.test = 0.05, which = 2)

Optimization statistics from multiple CV runs are shown (as well as the empirical "all data" result from optimFPM) as lines. When run with multiple alpha.test and FN_crit inputs, cvFPM will generate a heat map simliar to optimFPM with values representing on the mean result from CV runs.

Based on the output from either optimFPM or cvFPM, the practitioner may decide to adjust and rerun the FPM using optimized inputs. The decision of whether this is appropriate will depend on the specific management scenario. For example, the optimal FN Limit could end up being 80%, but that lack of conservatism may be unacceptable to a regulator. Similarly, the optimal alpha.test could be 0.5, meaning that the regulator would need to accept a 50% probability of error when selecting significant chemicals; this is a very high uncertainty, 10-fold higher than typical. Regardless, optimization provides useful context to the RFPM user.

Variable Importance

The chemVI function outputs several potentially useful metrics for evaluating individual metals in the sediment quality benchmark set. These metrics are:
- chemDensity: how little the benchmark "floated" within the FPM algorithm. High density values correspond to benchmarks that remained at a relatively low concentrations (i.e., did not float up), suggesting relative importance
- MADP: average change in other chemical's benchmarks resulting from removal of a chemical from data
- dOR: change in the overall reliability resulting from removal of the chemical from data
- dFM: change in the Folkes-Mallows Index resulting from removal of the chemical from data
- dMCC: change in the Matthew's COrrelation Coefficient resulting from removal of the chemical from data

The MADP and dOR metrics are the more useful for identifying important chemicals; for example:

chemVI(h.northport, p.northport)

Here we see that copper (Cu) has an MADP, dOR, dFM and dMCC equal to 0, meaning that there was no change in the benchmarks or their ability to predict toxicity after removing copper. Clearly copper is, therefore, not important in this case and could be excluded from further consideration. However, that isn't to say that copper isn't related to toxicity; it's very important to keep in mind that the FPM is a classification tool that attempts to accurately predict Hit values. The significant chemicals in this list are mostly well correlated, so toxicity could be related to one, all, or perhaps none of the chemicals (i.e., an unmeasured but related chemical or non-chemical stressor). The inclusion of new chemicals in (or exclusion of important chemicals from) data can result in markedly different benchmarks and chemVI metrics (particularly for the chemDensity metric). Therefore, we recommend pre-screening data to remove chemicals that are unlikely to cause toxicity (e.g., due to poor bioavailability, low toxicity, or uniformly low concentrations). These chemicals, while potentially "significant" with respect to having higher concentrations when Hit == TRUE, may lead to problems down the road when predicting toxicity more generally. In other words, try to the extent possible to establish a weight of evidence pointing toward causation rather than letting pure correlation lead your analysis. In the current example, we perhaps should have pre-screened chemicals like iron (Fe) and aluminum (Al), which we expected to have low toxicity at natural levels. Aluminum was ultimately screened out by FPM as non-significant, but iron was significant and retained.
chemVI can be helpful with respect to developing a parsimonious benchark set by identifying benchmarks that do not affect prediction at the site. In our h.northport example, Cu and Fe will be removed because they don't affect toxicity predictions (either positively or negatively). Whenever any chemicals would be removed from consideration through an iterative FPM benchmark development process, FPM must be rerun to generate new benchmarks. We caution the user to drop only one chemical at a time, rerunning FPM each time. It is possible that dropping one chemical can result in unexpected results, such as other chemicals becoming more important.

Compare the two outputs:

FPM(data = h.northport, paramList = p.northport)
FPM(data = h.northport, paramList = c("Cr", "Zn"))

For the FPM benchmarks generated for h.northport, the MADP for Fe was >0, and the benchmark for Zn shifted after removing Fe from consideration. Although the benchmark changed, the classification errors did not, as reflected by the unchanged OR. The final FPM benchmark set including only Cr and Zn is, therefore, the most parsimonious, by which we mean that it classifies Hit values just as accurately as the larger set of benchmarks but does so with fewer benchmarks. We recommend using a parsimonious set of benchmarks on a site-specific basis, whenever possible, because this should avoid problems of "overfitting" that might cause confounding variables to trigger erroneous conclusions.

You may have noticed that we didn't use optimized inputs in the example above for chemVI. We omitted that to provide an example where relatively unimportant chemicals could be screened out using the metrics generated by chemVI. If we instead change the inputs of chemVI to the optimized FN_crit and alpha.test (using the OR optimization approach), the chemVI results look like this instead:

FPM(h.northport, p.northport, FN_crit = 0.25, alpha.test = 0.15)$FPM
chemVI(h.northport, p.northport, FN_crit = 0.25, alpha.test = 0.15)

In this case, adjusting alpha.test has allowed for 2 new chemicals to be added, aluminum and lead (Pb), which were not selected when alpha.test = 0.05. After including these chemicals (and slightly adjusting the FN_crit), the final OR increased to 67%, a 10% improvement in accuracy over the default values (not shown). MCC increased by a similar margin. Based on the results of chemVI shown above, there is a marked shift in the relative importance of metals for toxicity predictions. Removing any of the chemicals would result in a roughly 10% reduction in accuracy (based on the dOR metric), and removing Cu now would have an effect on the other benchmarks values (as shown by the MADP metric >0). The chemDensity value for Cr has dropped to <1%, meaning that the benchmark is now close to the maximum possible value, whereas it had one of the highest chemDensity values before.

As you can see, variable importance is dependent on the set of benchmarks, not individual benchmarks in isolation. The addition of new chemicals impacted not only on the relative importance of each chemical but the overall reliability of the set of chemicals and on the FPM benchmark values themselves.

Notes on Benchmark Applications and Interpretation

We want to emphasize several findings from our analyses of the characteristics, behavior, and sensitivity of the FPM (the subject of two anticipated manuscripts).

FPM benchmarks should be treated as a set of values, not values to be used singularly or mixed and matched between runs of the FPM using different species, toxicity test endpoints, sites, etc. There may be a desire or an inclination to select the most sensitive species and test endpoints among chemicals. Doing so, however, would invalidate the various assumptions made when generating FPM benchmarks, including the projected FN Limit (i.e., intended FN_crit and actual pFN) and the accuracy of benchmarks (i.e., OR, FM, and MCC). If there are multiple species that need to be accounted for by the model, consider different approaches to defining Hit values that capture multiple species and/or endpoints prior to running FPM. For example, assign Hit == TRUE if any of several endpoints indicates toxicity in a given sample. This approach will result in a single benchmark set for all endpoints rather than generating multiple sets of benchmarks (as was done by Ecology 2011). The resulting benchmarks may be conservative, but they would not deviate from the intended statistical underpinnings of FPM outputs in the same way that mixing-and-matching multiple benchmark sets would.
FPM benchmarks should be considered as predictive of toxicity classifications (i.e., Hit values) rather than continuous toxicity test results (e.g., 10% mortality, 27% biomass, etc.). For this reason, we would generally argue against the application of FPM benchmarks within a Natural Resource Damage Assessment (NRDA) context. We recognize that future approaches might be proposed for applying FPM benchmarks in a NRDA context; while theoretically possible, such attempts should be considered carefully, particularly with regard to how Hits are defined (and whether the definition reasonably corresponds with definitions of injury).
FPM benchmarks are most powerful when they are site-specific. By using site toxicity and sediment chemistry data, and by implicitly considering factors like bioavailability, FPM benchmarks should improve upon existing default benchmarks (Ecology 2011). Therefore, we recommend using RFPM and site-specific data to the extent practicable to derive new benchmarks.
While it is most common to generate benchmarks based on bulk sediment concentrations, it is not necessary that this be the case. For example, benchmarks could reflect organic carbon-normalized sediment concentrations, pore water concentrations, etc. These measurements might improve the relevance of toxicity predictions by taking bioavailability more explicitly into account. Attempts to generate benchmarks based on these types of data simply complicate the application of the benchmarks by requiring all future samples be analyzed and concentrations reported in the same way. Moreover, some approaches will not be possible such as the simultaneously extracted metal - acid volatile sulfide method, which can generate non-positive data (that cannot be handled by FPM).
While there may be a concern about mixing chemicals of different types and mechanisms/modes of action (e.g., metals and polycyclic aromatic hydrocarbons) when generating FPM benchmarks, it does not seem to us that such a distinction should matter. From a conceptual standpoint, the FPM can be used to predict any variable that is distilled to a TRUE/FALSE classification from a set of one or more independent variables (chemical or otherwise). The key limitations that arise when mixing chemicals (and/or non-chemical stressors) are:
All analytes added to the FPM via data need to have a common directionality with respect to Hits. In other words, toxicity is expected to increase as, for example, copper increases. Therefore, you could not also include pore water acidity, which increases toxicity at both high and low pH. This limitation is related to the chemical selection process, which (by default) assumes that concentrations will be higher among toxic samples. It also relates to the iteratively increasing nature of the FPM algorithm; concentrations start at a low level and then shift in only one direction (up/increasing).
The inclusion of many correlated variables may increase the number of irrelevant benchmarks. For example, in the analysis of h.northport dataset presented above (using default FPM inputs), iron was correlated with other metals (not shown) and, as a result, was selected for inclusion among the benchmarks despite being relatively non-toxic and contributing not at all to the predictive accuracy of the benchmark set. As described above, chemVI can help to identify potential simplifications to the benchmark set when correlated variables lead to unnecessary complexity. The user is also free to pre-screen their data to remove redundant variables.
FPM has been found to work poorly or not at all with missing data. Consider removing analytes with incomplete datasets, omit samples with partial analysis, and/or impute values to fill data gaps.
FPM does not currently address non-detection in a meaningful way. As the model is largely based on percentiles and ranges rather than averages, this shouldn't generally lead to major problems as long as there isn't a high degree of non-detection (e.g., >20%). We leave it to the user to address how non-detects will be handled and to use FPM benchmarks based on non-detected data with caution. High degrees of non-detection of a chemical is a good reason for excluding it from data (or paramList), as there is not a reasonable way to use such a chemical to predict Hits. The use of more advanced censored data imputation methods (e.g., Kaplan-Meier or Regression on Order Statistics [ROS]) may provide useful, but should also be applied with care; using these methods (in addition to treating non-detects as half of detection limits or as zero) have the potential to result in benchmarks that fall below detection limits. We would argue that it is unreasonable to apply benchmarks that fall at or below detection limits.
As with other toxicity-based benchmarks, it may be important to consider background chemical concentrations and reference area toxicity levels when developing and applying FPM benchmarks. Reference area toxicity can be incorporated explicitly into Hits definitions by normalizing toxicity data to the reference condition prior to classifying Hits, and background can be used to qualify chemical exceedances of FPM benchmarks that fall below background levels. Examples of applying a reference condition would include dividing site toxicity by mean reference area toxicity (then applying a default effect threshold like 80%) or using a percentile of reference toxicity as the defining threshold for Hits (e.g., site toxicity >80th percentile of reference effect).

References cited:

Avocet. 2003. Development of freshwater sediment quality values for use in Washington State. Phase II report: Development and recommendation of SQVs for freshwater sediments in Washington State. Publication No. 03-09-088. Prepared for Washington Department of Ecology. Avocet Consulting, Kenmore, WA.

Ecology. 2011. Development of benthic SQVs for freshwater sediments in Washington, Oregon, and Idaho. Publication no. 11-09-054. Toxics Cleanup Program, Washington State Department of Ecology, Olympia, WA.