R/modStats.R In openair: Tools for the Analysis of Air Pollution Data

Documented in modStats

##' Calculate common model evaluation statistics
##'
##' Function to calculate common numerical model evaluation statistics with
##' flexible conditioning
##'
##' This function is under development and currently provides some common model
##' evaluation statistics. These include (to be mathematically defined later):
##'
##' \itemize{
##'
##' \item \eqn{n}, the number of complete pairs of data.
##'
##' \item \eqn{FAC2}, fraction of predictions within a factor of two.
##'
##' \item \eqn{MB}, the mean bias.
##'
##' \item \eqn{MGE}, the mean gross error.
##'
##' \item \eqn{NMB}, the normalised mean bias.
##'
##' \item \eqn{NMGE}, the normalised mean gross error.
##'
##' \item \eqn{RMSE}, the root mean squared error.
##'
##' \item \eqn{r}, the Pearson correlation coefficient. Note, can also
##' supply and aurument \code{method} e.g. \code{method = "spearman"}
##'
##' \item \eqn{COE}, the \emph{Coefficient of Efficiency} based on
##' Legates and McCabe (1999, 2012). There have been many suggestions
##' for measuring model performance over the years, but the COE is a
##' simple formulation which is easy to interpret.
##'
##' A perfect model has a COE = 1. As noted by Legates and McCabe
##' although the COE has no lower bound, a value of COE = 0.0 has a
##' fundamental meaning. It implies that the model is no more able to
##' predict the observed values than does the observed
##' mean. Therefore, since the model can explain no more of the
##' variation in the observed values than can the observed mean, such
##' a model can have no predictive advantage.
##'
##' For negative values of COE, the model is less effective than the
##' observed mean in predicting the variation in the
##' observations.
##' \item \eqn{IOA}, the Index of Agreement based on Willmott et
##' al. (2011), which spans between -1 and +1 with values approaching
##' +1 representing better model performance.
##'
##' An IOA of 0.5, for example, indicates that the sum of the
##' error-magnitudes is one half of the sum of the observed-deviation
##' magnitudes.  When IOA = 0.0, it signifies that the sum of the
##' magnitudes of the errors and the sum of the observed-deviation
##' magnitudes are equivalent. When IOA = -0.5, it indicates that the
##' sum of the error-magnitudes is twice the sum of the perfect
##' model-deviation and observed-deviation magnitudes. Values of IOA
##' near -1.0 can mean that the model-estimated deviations about O are
##' poor estimates of the observed deviations; but, they also can mean
##' that there simply is little observed variability - so some caution
##' is needed when the IOA approaches -1.
##' }
##'
##' All statistics are based on complete pairs of \code{mod} and \code{obs}.
##'
##' Conditioning is possible through setting \code{type}, which can be
##' a vector e.g. \code{type = c("weekday", "season")}.
##'
##' Details of the formulas are given in the openair manual.
##'
##'
##' @param mydata A data frame.
##' @param mod Name of a variable in \code{mydata} that respresents modelled
##'   values.
##' @param obs Name of a variable in \code{mydata} that respresents measured
##'   values.
##' @param statistic The statistic to be calculated. See details below
##' for a description of each.
##' @param type \code{type} determines how the data are split
##' i.e. conditioned, and then plotted. The default is will produce
##' statistics using the entire data. \code{type} can be one of the
##' built-in types as detailed in \code{cutData} e.g. \dQuote{season},
##' \dQuote{year}, \dQuote{weekday} and so on. For example, \code{type
##' = "season"} will produce four sets of statistics --- one for each
##' season.
##'
##' It is also possible to choose \code{type} as another variable in
##' the data frame. If that variable is numeric, then the data will be
##' split into four quantiles (if possible) and labelled
##' accordingly. If type is an existing character or factor variable,
##' then those categories/levels will be used directly. This offers
##' great flexibility for understanding the variation of different
##' variables and how they depend on one another.
##'
##' More than one type can be considered e.g. \code{type = c("season",
##' "weekday")} will produce statistics split by season and day of the
##' week.
##' @param rank.name Simple model ranking can be carried out if
##' \code{rank.name} is supplied. \code{rank.name} will generally
##' refer to a column representing a model name, which is to
##' ranked. The ranking is based the COE performance, as that
##' indicator is arguably the best single model performance indicator
##' available.
##' @param ... Other aruments to be passed to \code{cutData} e.g.
##'   \code{hemisphere = "southern"}
##' @export
##' @return Returns a data frame with model evaluation statistics.
##' @author David Carslaw
##' @references
##' Legates DR, McCabe GJ. (1999). Evaluating the use of goodness-of-fit
##' measures in hydrologic and hydroclimatic model validation. Water
##' Resources Research 35(1): 233-241.
##'
##' Legates DR, McCabe GJ. (2012). A refined index of model
##' performance: a rejoinder, International Journal of Climatology.
##'
##' Willmott, C.J., Robeson, S.M., Matsuura, K., 2011. A
##' refined index of model performance. International Journal of
##' Climatology.
##' @keywords methods
##' @examples
##'
##' ## the example below is somewhat artificial --- assuming the observed
##' ## values are given by NOx and the predicted values by NO2.
##'
##' modStats(mydata, mod = "no2", obs = "nox")
##'
##' ## evaluation stats by season
##'
##' modStats(mydata, mod = "no2", obs = "nox", type = "season")
##'
##'
modStats <- function(mydata, mod = "mod", obs = "obs",
statistic = c(
"n", "FAC2", "MB", "MGE", "NMB",
"NMGE", "RMSE", "r", "COE", "IOA"
),
type = "default", rank.name = NULL, ...) {
## function to calculate model evaluation statistics
## the default is to use the entire data set.
## Requires a field "date" and optional conditioning variables representing measured and modelled values

## extract variables of interest
vars <- c(mod, obs)

if (any(type %in% dateTypes)) vars <- c("date", vars)

theStats <- c(
"n", "FAC2", "MB", "MGE", "NMB", "NMGE", "RMSE",
"r", "COE", "IOA"
)

matching <- statistic %in% theStats

if (any(!matching)) {
## not all variables are present
stop(cat("Can't find the statistic(s)", statistic[!matching], "\n"))
}

## check the data
mydata <- checkPrep(
mydata, vars, type,
remove.calm = FALSE,
strip.white = FALSE
)

mydata <- cutData(mydata, type, ...)

## calculate the various statistics

if ("n" %in% statistic) {
res.n <- group_by(mydata, UQS(syms(type))) %>%
do(n(., mod, obs))
} else {
res.n <- NULL
}

if ("FAC2" %in% statistic) {
res.FAC <- group_by(mydata, UQS(syms(type))) %>%
do(FAC2(., mod, obs))
} else {
res.FAC <- NULL
}

if ("MB" %in% statistic) {
res.MB <- group_by(mydata, UQS(syms(type))) %>%
do(MB(., mod, obs))
} else {
res.MB <- NULL
}

if ("MGE" %in% statistic) {
res.MGE <- group_by(mydata, UQS(syms(type))) %>%
do(MGE(., mod, obs))
} else {
res.MGE <- NULL
}

if ("NMB" %in% statistic) {
res.NMB <- group_by(mydata, UQS(syms(type))) %>%
do(NMB(., mod, obs))
} else {
res.NMB <- NULL
}

if ("NMGE" %in% statistic) {
res.NMGE <- group_by(mydata, UQS(syms(type))) %>%
do(NMGE(., mod, obs))
} else {
res.NMGE <- NULL
}

if ("RMSE" %in% statistic) {
res.RMSE <- group_by(mydata, UQS(syms(type))) %>%
do(RMSE(., mod, obs))
} else {
res.RMSE <- NULL
}

if ("r" %in% statistic) {
res.r <- group_by(mydata, UQS(syms(type))) %>%
do(r(., mod, obs))
} else {
res.r <- NULL
}

if ("COE" %in% statistic) {
res.COE <- group_by(mydata, UQS(syms(type))) %>%
do(COE(., mod, obs))
} else {
res.COE <- NULL
}

if ("IOA" %in% statistic) {
res.IOA <- group_by(mydata, UQS(syms(type))) %>%
do(IOA(., mod, obs))
} else {
res.IOA <- NULL
}

## merge them all into one data frame
results <- list(
res.n, res.FAC, res.MB, res.MGE, res.NMB, res.NMGE, res.RMSE, res.r,
res.COE, res.IOA
)

## remove NULLs from lits
results <- results[!sapply(results, is.null)]
results <- Reduce(function(x, y, by = type) merge(x, y, by = type, all = TRUE), results)

results <- sortDataFrame(results, key = type)

## simple ranking of models?
if (!is.null(rank.name)) {
types <- setdiff(type, rank.name)

if (length(types) == 0) {
results <- rankModels(results, rank.name)
} else {
results <- group_by(results, UQS(syms(types))) %>%
do(rankModels(., rank.name = rank.name))
}
}

results
}

sortDataFrame <- function(x, key, ...) {
## function to sort a data frame given one or more column names (key)
## from http://tolstoyc.newcastle.edu.au/R/help/04/07/1076.html

if (missing(key)) {
rn <- rownames(x)
if (all(rn %in% 1:nrow(x))) rn <- as.numeric(rn)
x[order(rn, ...), , drop = FALSE]
} else {
x[do.call("order", c(x[key], ...)), , drop = FALSE]
}
}

rankModels <- function(mydata, rank.name = "group") {

## sort by COE
mydata <- sortDataFrame(mydata, "COE", decreasing = TRUE)
}

## number of valid readings
n <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])
res <- nrow(x)
data.frame(n = res)
}

## fraction within a factor of two
FAC2 <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])
ratio <- x[[mod]] / x[[obs]]
ratio <- na.omit(ratio)
len <- length(ratio)
if (len > 0) {
res <- length(which(ratio >= 0.5 & ratio <= 2)) / len
} else {
res <- NA
}
data.frame(FAC2 = res)
}

## mean bias
MB <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])
res <- mean(x[[mod]] - x[[obs]])
data.frame(MB = res)
}

## mean gross error
MGE <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])
res <- mean(abs(x[[mod]] - x[[obs]]))
data.frame(MGE = res)
}

## normalised mean bias
NMB <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])
res <- sum(x[[mod]] - x[[obs]]) / sum(x[[obs]])
data.frame(NMB = res)
}

## normalised mean gross error
NMGE <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])
res <- sum(abs(x[[mod]] - x[[obs]])) / sum(x[[obs]])
data.frame(NMGE = res)
}

## root mean square error
RMSE <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])
res <- mean((x[[mod]] - x[[obs]]) ^ 2) ^ 0.5
data.frame(RMSE = res)
}

## correlation coefficient
# when SD=0; will return(NA)
r <- function(x, mod = "mod", obs = "obs", ...) {
x <- na.omit(x[, c(mod, obs)])
res <- suppressWarnings(cor(x[[mod]], x[[obs]], ...))

data.frame(r = res)
}

##  Coefficient of Efficiency
COE <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])

res <- 1 - sum(abs(x[[mod]] - x[[obs]])) / sum(abs(x[[obs]] - mean(x[[obs]])))

data.frame(COE = res)
}

##  Index of Agreement
IOA <- function(x, mod = "mod", obs = "obs") {
x <- na.omit(x[, c(mod, obs)])

LHS <- sum(abs(x[[mod]] - x[[obs]]))
RHS <- 2 * sum(abs(x[[obs]] - mean(x[[obs]])))

if (LHS <= RHS) res <- 1 - LHS / RHS else res <- RHS / LHS - 1

data.frame(IOA = res)
}


Try the openair package in your browser

Any scripts or data that you put into this service are public.

openair documentation built on Nov. 22, 2021, 9:08 a.m.