R/conditionalQuantile.R In openair: Tools for the Analysis of Air Pollution Data

Documented in conditionalQuantile

##' Conditional quantile estimates for model evaluation
##'
##' Function to calculate conditional quantiles with flexible conditioning. The
##' function is for use in model evaluation and more generally to help better
##' understand forecast predictions and how well they agree with observations.
##'
##' Conditional quantiles are a very useful way of considering model
##' performance against observations for continuous measurements (Wilks, 2005).
##' The conditional quantile plot splits the data into evenly spaced bins. For
##' each predicted value bin e.g. from 0 to 10~ppb the \emph{corresponding}
##' values of the observations are identified and the median, 25/75th and 10/90
##' percentile (quantile) calculated for that bin. The data are plotted to show
##' how these values vary across all bins. For a time series of observations
##' and predictions that agree precisely the median value of the predictions
##' will equal that for the observations for each bin.
##'
##' The conditional quantile plot differs from the quantile-quantile plot (Q-Q
##' plot) that is often used to compare observations and predictions. A
##' Q-Q~plot separately considers the distributions of observations and
##' predictions, whereas the conditional quantile uses the corresponding
##' observations for a particular interval in the predictions. Take as an
##' example two time series, the first a series of real observations and the
##' second a lagged time series of the same observations representing the
##' predictions. These two time series will have identical (or very nearly
##' identical) distributions (e.g. same median, minimum and maximum). A Q-Q
##' plot would show a straight line showing perfect agreement, whereas the
##' conditional quantile will not. This is because in any interval of the
##' predictions the corresponding observations now have different values.
##'
##' Plotting the data in this way shows how well predictions agree with
##' observations and can help reveal many useful characteristics of how well
##' model predictions agree with observations --- across the full distribution
##' of values. A single plot can therefore convey a considerable amount of
##' information concerning model performance. The \code{conditionalQuantile}
##' function in openair allows conditional quantiles to be considered in a
##' flexible way e.g. by considering how they vary by season.
##'
##' The function requires a data frame consisting of a column of
##' observations and a column of predictions. The observations are
##' split up into \code{bins} according to values of the
##' predictions. The median prediction line together with the 25/75th
##' and 10/90th quantile values are plotted together with a line
##' showing a \dQuote{perfect} model. Also shown is a histogram of
##' predicted values (shaded grey) and a histogram of observed values
##' (shown as a blue line).
##'
##' Far more insight can be gained into model performance through conditioning
##' using \code{type}. For example, \code{type = "season"} will plot
##' conditional quantiles by each season. \code{type} can also be a factor or
##' character field e.g. representing different models used.
##'
##' See Wilks (2005) for more details and the examples below.
##'
##' @param mydata A data frame containing the field \code{obs} and \code{mod}
##'   representing observed and modelled values.
##' @param obs The name of the observations in \code{mydata}.
##' @param mod The name of the predictions (modelled values) in \code{mydata}.
##' @param type \code{type} determines how the data are split
##' i.e. conditioned, and then plotted. The default is will produce a
##' single plot using the entire data. Type can be one of the built-in
##' types as detailed in \code{cutData} e.g. \dQuote{season},
##' \dQuote{year}, \dQuote{weekday} and so on. For example, \code{type
##' = "season"} will produce four plots --- one for each season.
##'
##' It is also possible to choose \code{type} as another variable in the data
##'   frame. If that variable is numeric, then the data will be split into four
##'   quantiles (if possible) and labelled accordingly. If type is an existing
##'   character or factor variable, then those categories/levels will be used
##'   directly. This offers great flexibility for understanding the variation
##'   of different variables and how they depend on one another.
##'
##' Type can be up length two e.g. \code{type = c("season", "weekday")} will
##'   produce a 2x2 plot split by season and day of the week. Note, when two
##'   types are provided the first forms the columns and the second the rows.
##' @param bins Number of bins to be used in calculating the different quantile
##'   levels.
##' @param min.bin The minimum number of points required for the estimates of
##'   the 25/75th and 10/90th percentiles.
##' @param xlab label for the x-axis, by default \dQuote{predicted value}.
##' @param ylab label for the y-axis, by default \dQuote{observed value}.
##' @param col Colours to be used for plotting the uncertainty bands and median
##'   line. Must be of length 5 or more.
##' @param key.columns Number of columns to be used in the key.
##' @param key.position Location of the key e.g. \dQuote{top},
##' \dQuote{bottom}, \dQuote{right}, \dQuote{left}. See \code{lattice}
##' \code{xyplot} for more details.
##' @param auto.text Either \code{TRUE} (default) or \code{FALSE}. If
##'   \code{TRUE} titles and axis labels etc. will automatically try and format
##'   pollutant names and units properly e.g.  by subscripting the 2' in NO2.
##' @param \dots Other graphical parameters passed onto \code{cutData} and
##'   \code{lattice:xyplot}. For example, \code{conditionalQuantile} passes the option
##'   \code{hemisphere = "southern"} on to \code{cutData} to provide southern
##'   (rather than default northern) hemisphere handling of \code{type = "season"}.
##'   Similarly, common axis and title labelling options (such as \code{xlab},
##'   \code{ylab}, \code{main}) are passed to \code{xyplot} via \code{quickText}
##'   to handle routine formatting.
##' @import latticeExtra purrr tidyr
##' @export
##' @author David Carslaw
##' @seealso See \code{\link{modStats}} for model evaluation statistics and the
##'   package \code{verification} for comprehensive functions for forecast
##'   verification.
##' @references
##'
##' Murphy, A. H., B.G. Brown and Y. Chen. (1989) Diagnostic
##' Verification of Temperature Forecasts, Weather and Forecasting,
##' Volume: 4, Issue: 4, Pages: 485-501.
##'
##' Wilks, D. S., 2005. Statistical Methods in the
##' Atmospheric Sciences, Volume 91, Second Edition (International
##' Geophysics), 2nd Edition. Academic Press.
##' @keywords methods
##' @examples
##'
##'
##' # load example data from package
##' data(mydata)
##'
##' ## make some dummy prediction data based on 'nox'
##' mydata$mod <- mydata$nox*1.1 + mydata$nox * runif(1:nrow(mydata)) ##' ##' # basic conditional quantile plot ##' ## A "perfect" model is shown by the blue line ##' ## predictions tend to be increasingly positively biased at high nox, ##' ## shown by departure of median line from the blue one. ##' ## The widening uncertainty bands with increasing NOx shows that ##' ## hourly predictions are worse for higher NOx concentrations. ##' ## Also, the red (median) line extends beyond the data (blue line), ##' ## which shows in this case some predictions are much higher than ##' ## the corresponding measurements. Note that the uncertainty bands ##' ## do not extend as far as the median line because there is insufficient ##' # to calculate them ##' conditionalQuantile(mydata, obs = "nox", mod = "mod") ##' ##' ## can split by season to show seasonal performance (not very ##' ## enlightening in this case - try some real data and it will be!) ##' ##' \dontrun{conditionalQuantile(mydata, obs = "nox", mod = "mod", type = "season")} ##' ##' ##' conditionalQuantile <- function(mydata, obs = "obs", mod = "mod", type = "default", bins = 31, min.bin = c(10, 20), xlab = "predicted value", ylab = "observed value", col = brewer.pal(5, "YlOrRd"), key.columns = 2, key.position = "bottom", auto.text = TRUE, ...) { ## partly based on from Wilks (2005) and package verification, with many modifications # keep R check quite data = second = third = NULL if (length(type) > 2) stop("Only two types can be used with this function") ## extra.args setup extra.args <- list(...) ## set graphics current.strip <- trellis.par.get("strip.background") current.font <- trellis.par.get("fontsize") ## reset graphic parameters on.exit(trellis.par.set( fontsize = current.font )) # label controls # (xlab and ylab handled in formals because unique action) extra.args$main <- if ("main" %in% names(extra.args)) {
quickText(extra.args$main, auto.text) } else { quickText("", auto.text) } if ("fontsize" %in% names(extra.args)) { trellis.par.set(fontsize = list(text = extra.args$fontsize))
}

if (length(col) == 1 && col == "greyscale") {
trellis.par.set(list(strip.background = list(col = "white")))
# other local colours
ideal.col <- "black"
col.1 <- grey(0.75)
col.2 <- grey(0.5)
col.5 <- grey(0.25)
} else {
ideal.col <- "#0080ff"
col.1 <- col[1]
col.2 <- col[2]
col.5 <- col[5]
}

vars <- c(mod, obs)

if (any(type %in% dateTypes)) vars <- c("date", vars)

## check the data
mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)
mydata <- na.omit(mydata)
mydata <- cutData(mydata, type)

procData <- function(mydata) {

mydata <- select_if(mydata, is.numeric)
obs <- mydata[[obs]]
pred <- mydata[[mod]]
min.d <- min(mydata)
max.d <- max(mydata)
bins <- seq(floor(min.d), ceiling(max.d), length = bins)

lo <- min(bins)
hi <- max(bins)
b <- bins[-length(bins)]
labs <- b + 0.5 * diff(bins)
obs.cut <- cut(
obs,
breaks = bins, include.lowest = TRUE,
labels = labs
)
obs.cut[is.na(obs.cut)] <- labs[1]
obs.cut <- as.numeric(as.character(obs.cut))
pred.cut <- cut(
pred,
breaks = bins, include.lowest = TRUE,
labels = labs
)
pred.cut[is.na(pred.cut)] <- labs[1]

n <- length(labs)
lng <- tapply(obs, pred.cut, length)
med <- tapply(obs, pred.cut, median)
q1 <- tapply(obs, pred.cut, quantile, probs = 0.25)
q2 <- tapply(obs, pred.cut, quantile, probs = 0.75)
q1[lng <= min.bin[1]] <- NA
q2[lng <= min.bin[1]] <- NA
q3 <- tapply(obs, pred.cut, quantile, probs = 0.1)
q4 <- tapply(obs, pred.cut, quantile, probs = 0.9)
q3[lng <= min.bin[2]] <- NA
q4[lng <= min.bin[2]] <- NA

results <- data.frame(x = as.numeric(levels(pred.cut)), lng, med, q1, q2, q3, q4)

results.cut <- data.frame(pred.cut = as.numeric(as.character(pred.cut)), obs.cut = obs)

## range taken by observations
results.obs <- data.frame(min = min(obs), max = max(obs))
results <- list(results, results.cut, results.obs)
results
}

lo <- min(mydata[c(mod, obs)])
hi <- max(mydata[c(mod, obs)])

all.results <- group_by(mydata, UQS(syms(type))) %>%
group_nest() %>%
mutate(results = map(data, procData))

results <- all.results %>%
mutate(first = map(results, 1)) %>%
unnest(first)

hist.results <- all.results %>%
mutate(second = map(results, 2)) %>%
unnest(second)

obs.results <- all.results %>%
mutate(third = map(results, 3)) %>%
unnest(third)

## proper names of labelling #################################################
pol.name <- sapply(levels(results[[type[1]]]), function(x) quickText(x, auto.text))
strip <- strip.custom(factor.levels = pol.name)

if (length(type) == 1) {
strip.left <- FALSE
if (type == "default") strip <- FALSE
} else { ## two conditioning variables

pol.name <- sapply(levels(results[[type[2]]]), function(x) quickText(x, auto.text))
strip.left <- strip.custom(factor.levels = pol.name)
}
## ###########################################################################

temp <- paste(type, collapse = "+")
myform <- formula(paste("x ~ med | ", temp, sep = ""))

xyplot.args <- list(
x = myform, data = results,
xlim = c(lo, hi * 1.05),
ylim = c(lo, hi * 1.05),
ylab = quickText(ylab, auto.text),
xlab = quickText(xlab, auto.text),
as.table = TRUE,
aspect = 1,
strip = strip,
strip.left = strip.left,
key = list(
lines = list(
col = c(col.1, col.2, col.5, ideal.col),
lwd = c(15, 15, 2, 1)
),
lines.title = 1, title = "", text = list(lab = c(
"25/75th percentile",
"10/90th percentile",
"median",
"perfect model"
)),
space = key.position,
columns = key.columns
),
par.strip.text = list(cex = 0.8),
panel = function(x, subscripts, ...) {
panel.grid(-1, -1, col = "grey95")

poly.na(
results$x[subscripts], results$q3[subscripts],
results$x[subscripts], results$q4[subscripts],
myColors = col.2, alpha = 1
)
poly.na(
results$x[subscripts], results$q1[subscripts],
results$x[subscripts], results$q2[subscripts],
myColors = col.1, alpha = 1
)

# draw line of where observations lie
theSubset <- inner_join(obs.results, results[subscripts[1], ], by = type)

panel.lines(
c(theSubset$min, theSubset$max), c(
theSubset$min, theSubset$max
),
col = ideal.col, lwd = 1.5
)
panel.lines(
results$x[subscripts], results$med[subscripts],
col = col.5, lwd = 2
)
}
)

# reset for extra.args

xyplot.args <- listUpdate(xyplot.args, extra.args)

# plot
scatter <- do.call(xyplot, xyplot.args)

temp <- paste(type, collapse = "+")
myform <- formula(paste(" ~ pred.cut | ", temp, sep = ""))
bins <- seq(floor(lo), ceiling(hi), length = bins)

pred.cut <- NULL ## avoid R NOTES

histo <- histogram(
myform,
data = hist.results, breaks = bins, type = "count",
as.table = TRUE,
strip = strip,
strip.left = strip.left,
col = "black", alpha = 0.1, border = NA,
par.strip.text = list(cex = 0.8),
ylab = "sample size for histograms",
panel = function(x = pred.cut, col = "black", border = NA,
alpha = 0.2,
subscripts, ...) {
## histogram of observations
panel.histogram(
x = hist.results[["obs.cut"]][subscripts],
col = NA, alpha = 0.5, lwd = 0.5,
border = ideal.col, ...
)
## histogram of modelled values
panel.histogram(x = x, col = "black", border, alpha = 0.15, ...)
}
)

## supress scaling warnings
thePlot <- latticeExtra::doubleYScale(scatter, histo, add.ylab2 = TRUE)
thePlot <- update(thePlot, par.settings = simpleTheme(col = c("black", "black")))

if (length(type) == 1) {
plot(thePlot)
} else {
plot(latticeExtra::useOuterStrips(
thePlot,
strip = strip,
strip.left = strip.left
))
}

invisible(trellis.last.object())

output <- list(plot = thePlot, data = results, call = match.call())
class(output) <- "openair"
invisible(output)
}
`

Try the openair package in your browser

Any scripts or data that you put into this service are public.

openair documentation built on Nov. 22, 2021, 9:08 a.m.