R/chemVI.R
In RFPM: Floating Percentile Model

Documented in chemVI

#' Chemical Variable Importance for Floating Percentile Model Benchmarks
#' 
#' Generate statistics describing the relative importance of chemicals among benchmarks generated by \code{FPM}
#' 
#' @param  data data.frame containing, at a minimum, chemical concentrations as columns and a logical \code{Hit} column classifying toxicity
#' @param  paramList character vector naming columns of \code{data} containing concentrations
#' @param  ... additional arguments passed to \code{chemSig}, \code{chemSigSelect}, and \code{FPM}
#' @details The purpose of \code{chemVI} is to inform the user about the relative influence of each chemical over the sediment quality benchmarks generated by \code{FPM}.
#' Three statistics are generated: \code{chemDensity}, \code{MADP}, \code{dOR}, \code{dFM}, and \code{dMCC}. The \code{chemDensity} statistic (which is also generated by \code{FPM})
#' describes how little a particular chemical's value increased within the floating percentile model algorithm.
#' Low \code{chemDensity} (close to 0) means that the value was able to increase substantially within the algorithm without triggering one or more of the criteria for
#' stopping the algorithm (see \code{?FPM}), whereas high \code{chemDensity} (close to 1) indicates the final benchmark for that chemical did not float (increase)  
#' much before being locked in. In other words, low \code{chemDensity} might be interpreted as relatively low importance. We caution against using this 
#' metric in isolation, as it is the more difficult to interpret of the three.
#' The \code{MADP} statistic (or mean absolute difference percent) is calculated by sequentially dropping each chemical from consideration, recalculating the benchmarks
#' for the remaining chemicals, and then determining how much each benchmark changed (as a percent of the original value). Thus, the \code{MADP}
#' is a measure of a chemical's influence over other benchmarks. The \code{dOR} statistic is the difference between the overall reliability
#' of benchmarks with all chemicals versus without each chemical. \code{dFM} and \code{dMCC} are similar to the \code{dOR} statistic, but for the Fowlkes-Mallows Index
#' and Matthew's Correlation Coefficient. In any case, larger positive values indicate a greater impact of a chemical
#' on the overall predictive performance of floating percentile model benchmarks. Small values (close to 0) indicate low influence. Larger negative values indicate that
#' the chemical actually adversely impacts toxicity predictions. If there are chemicals with negative values, consider reevaluting the data without the associated chemical
#' or using \code{optimFPM} or \code{cvFPM} to optimize the overall reliability prior to running \code{FPM} and \code{chemVI}.
#' 
#' @seealso chemSig, chemSigSelect, optimFPM, cvFPM, FPM
#' @return data.frame with 2 columns
#' @examples
#' paramList = c("Cd", "Cu", "Fe", "Mn", "Ni", "Pb", "Zn")
#' chemVI(h.tristate, paramList, testType = "np")
#' chemVI(h.tristate, paramList, testType = "p")
#' @export

chemVI <- function(data, 
                   paramList, 
                   ...){
    fpm <- FPM(data, paramList, densInfo = T, ...)
    pL <- names(fpm[["FPM"]])[1:(length(fpm[["FPM"]]) - 12)]
    fpm.SQB <- fpm[["FPM"]][pL]
    fpm.STAT <- fpm[["FPM"]][c("sens", "spec", "OR", "FM", "MCC")]
    
    tmp <- list()
    tmp.SQB <- list()
    tmp.STAT <- list()
    
    for (i in 1:length(pL)){
         pL.i <- pL[-i]
            tmp[[i]] <- FPM(data, paramList = pL.i, paramOverride = T, ...)[["FPM"]]
            tmp.SQB[[i]] <- tmp[[i]][pL.i]
            tmp.STAT[[i]] <- tmp[[i]][c("sens", "spec", "OR", "FM", "MCC")]
    }
    
    tmp2 <- list(); tmp3 <- list(); tmp4 <- list(); tmp5 <- list()
    
    for (i in 1:length(tmp)){
        tmp2[[i]] <- 100 * mean(as.numeric((tmp.SQB[[i]] - fpm.SQB[-i])/fpm.SQB[-i]))
        tmp3[[i]] <- 100 * (tmp.STAT[[i]]$OR - fpm.STAT$OR)/fpm.STAT$OR
        tmp4[[i]] <- 100 * (tmp.STAT[[i]]$FM - fpm.STAT$FM)/fpm.STAT$FM
        tmp5[[i]] <- 100 * (tmp.STAT[[i]]$MCC - fpm.STAT$MCC)/fpm.STAT$MCC
    }
    
    tmp2 <- data.frame(tmp2); names(tmp2) <- pL
    tmp3 <- data.frame(tmp3); names(tmp3) <- pL
    tmp4 <- data.frame(tmp4); names(tmp4) <- pL
    tmp5 <- data.frame(tmp5); names(tmp5) <- pL
    
    x <- do.call(rbind, list(round(100 * fpm[["chemDensity"]], 3), round(tmp2, 3), round(tmp3, 3), round(tmp4, 3), round(tmp5, 3)))
    row.names(x) <- c("chemDensity", "MADP", "dOR", "dFM", "dMCC")
    return(t(x))
}## end code