##' Bivariate polar plot with smoothing
##'
##' Function for plotting pollutant concentration in polar coordinates
##' showing concentration by wind speed (or another numeric variable)
##' and direction. Mean concentrations are calculated for wind
##' speed-direction \sQuote{bins} (e.g. 0-1, 1-2 m/s,... and 0-10,
##' 10-20 degrees etc.). To aid interpretation, \code{gam} smoothing
##' is carried out using \code{mgcv}.
##'
##' The bivariate polar plot is a useful diagnostic tool for quickly
##' gaining an idea of potential sources. Wind speed is one of the
##' most useful variables to use to separate source types (see
##' references). For example, ground-level concentrations resulting
##' from buoyant plumes from chimney stacks tend to peak under higher
##' wind speed conditions. Conversely, ground-level, non-buoyant
##' plumes such as from road traffic, tend to have highest
##' concentrations under low wind speed conditions. Other sources such
##' as from aircraft engines also show differing characteristics by
##' wind speed.
##'
##' The function has been developed to allow variables other than wind
##' speed to be plotted with wind direction in polar coordinates. The
##' key issue is that the other variable plotted against wind
##' direction should be discriminating in some way. For example,
##' temperature can help reveal high-level sources brought down to
##' ground level in unstable atmospheric conditions, or show the
##' effect a source emission dependent on temperature e.g. biogenic
##' isoprene.
##'
##' The plots can vary considerably depending on how much smoothing is
##' done. The approach adopted here is based on the very flexible and
##' capable \code{mgcv} package that uses \emph{Generalized Additive
##' Models}. While methods do exist to find an optimum level of
##' smoothness, they are not necessarily useful. The principal aim of
##' \code{polarPlot} is as a graphical analysis rather than for
##' quantitative purposes. In this respect the smoothing aims to
##' strike a balance between revealing interesting (real) features and
##' overly noisy data. The defaults used in \code{polarPlot} are based
##' on the analysis of data from many different sources. More advanced
##' users may wish to modify the code and adopt other smoothing
##' approaches.
##'
##' Various statistics are possible to consider e.g. mean, maximum,
##' median. \code{statistic = "max"} is often useful for revealing
##' sources.
##'
##' Wind direction is split up into 10 degree intervals and the other
##' variable (e.g. wind speed) 30 intervals. These 2D bins are then
##' used to calculate the statistics.
##'
##' These plots often show interesting features at higher wind speeds
##' (see references below). For these conditions there can be very few
##' measurements and therefore greater uncertainty in the calculation
##' of the surface. There are several ways in which this issue can be
##' tackled. First, it is possible to avoid smoothing altogether and
##' use \code{\link{polarFreq}}. Second, the effect of setting a
##' minimum number of measurements in each wind speed-direction bin
##' can be examined through \code{min.bin}. It is possible that a
##' single point at high wind speed conditions can strongly affect the
##' surface prediction. Therefore, setting \code{min.bin = 3}, for
##' example, will remove all wind speed-direction bins with fewer than
##' 3 measurements \emph{before} fitting the surface. Third, consider
##' setting \code{uncertainty = TRUE}. This option will show the
##' predicted surface together with upper and lower 95% confidence
##' intervals, which take account of the frequency of measurements.
##'
##' Variants on \code{polarPlot} include \code{polarAnnulus} and
##' \code{polarFreq}.
##'
##' @param mydata A data frame minimally containing \code{wd}, another
##' variable to plot in polar coordinates (the default is a column
##' \dQuote{ws} --- wind speed) and a pollutant. Should also contain
##' \code{date} if plots by time period are required.
##' @param pollutant Mandatory. A pollutant name corresponding to a
##' variable in a data frame should be supplied e.g. \code{pollutant
##' = "nox"}. There can also be more than one pollutant specified
##' e.g. \code{pollutant = c("nox", "no2")}. The main use of using
##' two or more pollutants is for model evaluation where two species
##' would be expected to have similar concentrations. This saves the
##' user stacking the data and it is possible to work with columns
##' of data directly. A typical use would be \code{pollutant =
##' c("obs", "mod")} to compare two columns \dQuote{obs} (the
##' observations) and \dQuote{mod} (modelled values).
##' @param x Name of variable to plot against wind direction in polar
##' coordinates, the default is wind speed, \dQuote{ws}.
##' @param wd Name of wind direction field.
##' @param type \code{type} determines how the data are split i.e.
##' conditioned, and then plotted. The default is will produce a
##' single plot using the entire data. Type can be one of the
##' built-in types as detailed in \code{cutData} e.g.
##' \dQuote{season}, \dQuote{year}, \dQuote{weekday} and so on. For
##' example, \code{type = "season"} will produce four plots --- one
##' for each season.
##'
##' It is also possible to choose \code{type} as another variable in
##' the data frame. If that variable is numeric, then the data will
##' be split into four quantiles (if possible) and labelled
##' accordingly. If type is an existing character or factor
##' variable, then those categories/levels will be used directly.
##' This offers great flexibility for understanding the variation of
##' different variables and how they depend on one another.
##'
##' Type can be up length two e.g. \code{type = c("season",
##' "weekday")} will produce a 2x2 plot split by season and day of
##' the week. Note, when two types are provided the first forms the
##' columns and the second the rows.
##' @param statistic The statistic that should be applied to each wind
##' speed/direction bin. Can be \dQuote{mean} (default),
##' \dQuote{median}, \dQuote{max} (maximum), \dQuote{frequency}.
##' \dQuote{stdev} (standard deviation), \dQuote{weighted.mean},
##' \dQuote{cpf} (Conditional Probability Function) or \dQuote{r}
##' (correlation coefficient). Because of the smoothing involved,
##' the colour scale for some of these statistics is only to provide
##' an indication of overall pattern and should not be interpreted
##' in concentration units e.g. for \code{statistic =
##' "weighted.mean"} where the bin mean is multiplied by the bin
##' frequency and divided by the total frequency. In many cases
##' using \code{polarFreq} will be better. Setting \code{statistic =
##' "weighted.mean"} can be useful because it provides an indication
##' of the concentration * frequency of occurrence and will
##' highlight the wind speed/direction conditions that dominate the
##' overall mean.
##'
##' When \code{statistic = "cpf"} the conditional probability
##' function (CPF) is plotted and a single (usually high) percentile
##' level is supplied. The CPF is defined as CPF = my/ny, where my
##' is the number of samples in the y bin (by default a wind
##' direction, wind speed interval) with mixing ratios greater than
##' the \emph{overall} percentile concentration, and ny is the total
##' number of samples in the same wind sector (see Ashbaugh et al.,
##' 1985). Note that percentile intervals can also be considered;
##' see \code{percentile} for details.
##'
##' When \code{statistic = "r"}, the Pearson correlation coefficient
##' is calculated for \emph{two} pollutants. The calculation
##' involves a weighted Pearson correlation coefficient, which is
##' weighted by Gaussian kernels for wind direction an the radial
##' variable (by default wind speed). More weight is assigned to
##' values close to a wind speed-direction interval. Kernel
##' weighting is used to ensure that all data are used rather than
##' relying on the potentially small number of values in a wind
##' speed-direction interval.
##' @param resolution Two plot resolutions can be set: \dQuote{normal}
##' (the default) and \dQuote{fine}, for a smoother plot. It should
##' be noted that plots with a \dQuote{fine} resolution can take
##' longer to render and the default option should be sufficient or
##' most circumstances.
##' @param limits The function does its best to choose sensible limits
##' automatically. However, there are circumstances when the user
##' will wish to set different ones. An example would be a series of
##' plots showing each year of data separately. The limits are set
##' in the form \code{c(lower, upper)}, so \code{limits = c(0, 100)}
##' would force the plot limits to span 0-100.
##' @param exclude.missing Setting this option to \code{TRUE} (the
##' default) removes points from the plot that are too far from the
##' original data. The smoothing routines will produce predictions
##' at points where no data exist i.e. they predict. By removing the
##' points too far from the original data produces a plot where it
##' is clear where the original data lie. If set to \code{FALSE}
##' missing data will be interpolated.
##' @param uncertainty Should the uncertainty in the calculated
##' surface be shown? If \code{TRUE} three plots are produced on the
##' same scale showing the predicted surface together with the
##' estimated lower and upper uncertainties at the 95% confidence
##' interval. Calculating the uncertainties is useful to understand
##' whether features are real or not. For example, at high wind
##' speeds where there are few data there is greater uncertainty
##' over the predicted values. The uncertainties are calculated
##' using the GAM and weighting is done by the frequency of
##' measurements in each wind speed-direction bin. Note that if
##' uncertainties are calculated then the type is set to "default".
##' @param percentile If \code{statistic = "percentile"} then
##' \code{percentile} is used, expressed from 0 to 100. Note that
##' the percentile value is calculated in the wind speed, wind
##' direction \sQuote{bins}. For this reason it can also be useful
##' to set \code{min.bin} to ensure there are a sufficient number of
##' points available to estimate a percentile. See \code{quantile}
##' for more details of how percentiles are calculated.
##'
##' \code{percentile} is also used for the Conditional Probability
##' Function (CPF) plots. \code{percentile} can be of length two, in
##' which case the percentile \emph{interval} is considered for use
##' with CPF. For example, \code{percentile = c(90, 100)} will plot
##' the CPF for concentrations between the 90 and 100th percentiles.
##' Percentile intervals can be useful for identifying specific
##' sources. In addition, \code{percentile} can also be of length 3.
##' The third value is the \sQuote{trim} value to be applied. When
##' calculating percentile intervals many can cover very low values
##' where there is no useful information. The trim value ensures
##' that values greater than or equal to the trim * mean value are
##' considered \emph{before} the percentile intervals are
##' calculated. The effect is to extract more detail from many
##' source signatures. See the manual for examples. Finally, if the
##' trim value is less than zero the percentile range is interpreted
##' as absolute concentration values and subsetting is carried out
##' directly.
##' @param cols Colours to be used for plotting. Options include
##' \dQuote{default}, \dQuote{increment}, \dQuote{heat},
##' \dQuote{jet} and \code{RColorBrewer} colours --- see the
##' \code{openair} \code{openColours} function for more details. For
##' user defined the user can supply a list of colour names
##' recognised by R (type \code{colours()} to see the full list). An
##' example would be \code{cols = c("yellow", "green", "blue")}
##' @param weights At the edges of the plot there may only be a few
##' data points in each wind speed-direction interval, which could
##' in some situations distort the plot if the concentrations are
##' high. \code{weights} applies a weighting to reduce their
##' influence. For example and by default if only a single data
##' point exists then the weighting factor is 0.25 and for two
##' points 0.5. To not apply any weighting and use the data as is,
##' use \code{weights = c(1, 1, 1)}.
##'
##' An alternative to down-weighting these points they can be
##' removed altogether using \code{min.bin}.
##'
##' @param min.bin The minimum number of points allowed in a wind
##' speed/wind direction bin. The default is 1. A value of two
##' requires at least 2 valid records in each bin an so on; bins
##' with less than 2 valid records are set to NA. Care should be
##' taken when using a value > 1 because of the risk of removing
##' real data points. It is recommended to consider your data with
##' care. Also, the \code{polarFreq} function can be of use in such
##' circumstances.
##' @param mis.col When \code{min.bin} is > 1 it can be useful to show
##' where data are removed on the plots. This is done by shading the
##' missing data in \code{mis.col}. To not highlight missing data
##' when \code{min.bin} > 1 choose \code{mis.col = "transparent"}.
##' @param upper This sets the upper limit wind speed to be used.
##' Often there are only a relatively few data points at very high
##' wind speeds and plotting all of them can reduce the useful
##' information in the plot.
##' @param angle.scale The wind speed scale is by default shown at a
##' 315 degree angle. Sometimes the placement of the scale may
##' interfere with an interesting feature. The user can therefore
##' set \code{angle.scale} to another value (between 0 and 360
##' degrees) to mitigate such problems. For example
##' \code{angle.scale = 45} will draw the scale heading in a NE
##' direction.
##' @param units The units shown on the polar axis scale.
##' @param force.positive The default is \code{TRUE}. Sometimes if
##' smoothing data with steep gradients it is possible for predicted
##' values to be negative. \code{force.positive = TRUE} ensures that
##' predictions remain positive. This is useful for several reasons.
##' First, with lots of missing data more interpolation is needed
##' and this can result in artifacts because the predictions are too
##' far from the original data. Second, if it is known beforehand
##' that the data are all positive, then this option carries that
##' assumption through to the prediction. The only likely time where
##' setting \code{force.positive = FALSE} would be if background
##' concentrations were first subtracted resulting in data that is
##' legitimately negative. For the vast majority of situations it is
##' expected that the user will not need to alter the default
##' option.
##' @param k This is the smoothing parameter used by the \code{gam}
##' function in package \code{mgcv}. Typically, value of around 100
##' (the default) seems to be suitable and will resolve important
##' features in the plot. The most appropriate choice of \code{k} is
##' problem-dependent; but extensive testing of polar plots for many
##' different problems suggests a value of \code{k} of about 100 is
##' suitable. Setting \code{k} to higher values will not tend to
##' affect the surface predictions by much but will add to the
##' computation time. Lower values of \code{k} will increase
##' smoothing. Sometimes with few data to plot \code{polarPlot} will
##' fail. Under these circumstances it can be worth lowering the
##' value of \code{k}.
##' @param normalise If \code{TRUE} concentrations are normalised by
##' dividing by their mean value. This is done \emph{after} fitting
##' the smooth surface. This option is particularly useful if one is
##' interested in the patterns of concentrations for several
##' pollutants on different scales e.g. NOx and CO. Often useful if
##' more than one \code{pollutant} is chosen.
##' @param key.header Adds additional text/labels to the scale key.
##' For example, passing the options \code{key.header = "header",
##' key.footer = "footer1"} adds addition text above and below the
##' scale key. These arguments are passed to \code{drawOpenKey} via
##' \code{quickText}, applying the \code{auto.text} argument, to
##' handle formatting.
##' @param key.footer see \code{key.footer}.
##' @param key.position Location where the scale key is to plotted.
##' Allowed arguments currently include \code{"top"},
##' \code{"right"}, \code{"bottom"} and \code{"left"}.
##' @param key Fine control of the scale key via \code{drawOpenKey}.
##' See \code{drawOpenKey} for further details.
##' @param auto.text Either \code{TRUE} (default) or \code{FALSE}. If
##' \code{TRUE} titles and axis labels will automatically try and
##' format pollutant names and units properly e.g. by subscripting
##' the `2' in NO2.
##' @param ... Other graphical parameters passed onto
##' \code{lattice:levelplot} and \code{cutData}. For example,
##' \code{polarPlot} passes the option \code{hemisphere =
##' "southern"} on to \code{cutData} to provide southern (rather
##' than default northern) hemisphere handling of \code{type =
##' "season"}. Similarly, common axis and title labelling options
##' (such as \code{xlab}, \code{ylab}, \code{main}) are passed to
##' \code{levelplot} via \code{quickText} to handle routine
##' formatting.
##' @export
##' @import tidyr
##' @import mgcv
##' @return As well as generating the plot itself, \code{polarPlot}
##' also returns an object of class ``openair''. The object includes
##' three main components: \code{call}, the command used to generate
##' the plot; \code{data}, the data frame of summarised information
##' used to make the plot; and \code{plot}, the plot itself. If
##' retained, e.g. using \code{output <- polarPlot(mydata, "nox")},
##' this output can be used to recover the data, reproduce or rework
##' the original plot or undertake further analysis.
##'
##' An openair output can be manipulated using a number of generic
##' operations, including \code{print}, \code{plot} and
##' \code{summary}.
##'
##' \code{polarPlot} surface data can also be extracted directly
##' using the \code{results}, e.g. \code{results(object)} for
##' \code{output <- polarPlot(mydata, "nox")}. This returns a data
##' frame with four set columns: \code{cond}, conditioning based on
##' \code{type}; \code{u} and \code{v}, the translational vectors
##' based on \code{ws} and \code{wd}; and the local \code{pollutant}
##' estimate.
##' @author David Carslaw
##' @seealso \code{\link{polarCluster}} for identifying features in
##' bivairate polar plots and for post processing and
##' \code{\link{polarAnnulus}}, \code{\link{polarFreq}},
##' \code{\link{percentileRose}} for other ways of plotting
##' directional data.
##' @references
##'
##' Ashbaugh, L.L., Malm, W.C., Sadeh, W.Z., 1985. A residence time
##' probability analysis of sulfur concentrations at ground canyon
##' national park. Atmospheric Environment 19 (8), 1263-1270.
##'
##' Carslaw, D.C., Beevers, S.D, Ropkins, K and M.C. Bell (2006).
##' Detecting and quantifying aircraft and other on-airport
##' contributions to ambient nitrogen oxides in the vicinity of a
##' large international airport. Atmospheric Environment. 40/28 pp
##' 5424-5434.
##'
##' Carslaw, D.C., & Beevers, S.D. (2013). Characterising and
##' understanding emission sources using bivariate polar plots and
##' k-means clustering. Environmental Modelling & Software, 40,
##' 325-329. doi:10.1016/j.envsoft.2012.09.005
##'
##' Henry, R.C., Chang, Y.S., Spiegelman, C.H., 2002. Locating nearby
##' sources of air pollution by nonparametric regression of
##' atmospheric concentrations on wind direction. Atmospheric
##' Environment 36 (13), 2237-2244.
##'
##' Westmoreland, E.J., N. Carslaw, D.C. Carslaw, A. Gillah and E.
##' Bates (2007). Analysis of air quality within a street canyon
##' using statistical and dispersion modelling techniques. Atmospheric
##' Environment. Vol. 41(39), pp. 9195-9205.
##'
##' Yu, K.N., Cheung, Y.P., Cheung, T., Henry, R.C., 2004. Identifying
##' the impact of large urban airports on local air quality by
##' nonparametric regression. Atmospheric Environment 38 (27),
##' 4501-4507.
##' @keywords methods
##' @examples
##'
##'
##' # load example data from package
##' data(mydata)
##'
##' # basic plot
##' polarPlot(mydata, pollutant = "nox")
##'
##' # polarPlots by year on same scale
##' \dontrun{polarPlot(mydata, pollutant = "so2", type = "year", main = "polarPlot of so2")}
##'
##' # set minimum number of bins to be used to see if pattern remains similar
##' \dontrun{polarPlot(mydata, pollutant = "nox", min.bin = 3)}
##'
##' # plot by day of the week
##'
##' \dontrun{polarPlot(mydata, pollutant = "pm10", type = "weekday")}
##'
##' # show the 95% confidence intervals in the surface fitting
##' \dontrun{polarPlot(mydata, pollutant = "so2", uncertainty = TRUE)}
##'
##'
polarPlot <-
function(mydata, pollutant = "nox", x = "ws", wd = "wd",
type = "default", statistic = "mean", resolution = "normal",
limits = NA, exclude.missing = TRUE, uncertainty = FALSE,
percentile = NA, cols = "default", weights = c(0.25, 0.5, 0.75),
min.bin = 1, mis.col = "grey", upper = NA, angle.scale = 315,
units = x, force.positive = TRUE, k = 100, normalise = FALSE,
key.header = "", key.footer = pollutant, key.position = "right",
key = TRUE, auto.text = TRUE, ...)
{
## get rid of R check annoyances
z = NULL
if (statistic == "percentile" & is.na(percentile[1] & statistic != "cpf")) {
warning("percentile value missing, using 50")
percentile <- 50
}
if (statistic == "r" && length(pollutant) != 2)
stop("Correlation statistic requires two pollutants.")
# names of variables for use later
nam.x <- x
nam.wd <- wd
if (length(type) > 2) {stop("Maximum number of types is 2.")}
if (uncertainty) type <- "default" ## can't have conditioning here
if (uncertainty & length(pollutant) > 1)
stop("Can only have one pollutant when uncertainty = TRUE")
if (!statistic %in% c("mean", "median", "frequency", "max", "stdev",
"weighted.mean", "percentile", "cpf", "r"))
stop (paste("statistic '", statistic, "' not recognised", sep = ""))
if (length(weights) != 3) stop ("weights should be of length 3.")
if (missing(key.header)) key.header <- statistic
if (key.header == "weighted.mean") key.header <- "weighted\nmean"
if (key.header == "percentile")
key.header <- c(paste(percentile, "th", sep = ""), "percentile")
if ("cpf" %in% key.header) key.header <- c("CPF", "probability")
## extra.args setup
extra.args <- list(...)
## label controls
extra.args$xlab <- if ("xlab" %in% names(extra.args))
quickText(extra.args$xlab, auto.text) else quickText("", auto.text)
extra.args$ylab <- if ("ylab" %in% names(extra.args))
quickText(extra.args$ylab, auto.text) else quickText("", auto.text)
extra.args$main <- if ("main" %in% names(extra.args))
quickText(extra.args$main, auto.text) else quickText("", auto.text)
## layout default
if (!"layout" %in% names(extra.args))
extra.args$layout <- NULL
## extract variables of interest
vars <- c(wd, x, pollutant)
if (any(type %in% dateTypes)) vars <- c(vars, "date")
mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)
mydata <- na.omit(mydata)
## this is used later for the 'x' scale
min.scale <- min(mydata[[x]], na.rm = TRUE)
# scale data by subtracting the min value this helps with dealing
# with negative data on radial axis (starts from zero, always
# postive)
mydata[[x]] <- mydata[[x]] - min(mydata[[x]], na.rm = TRUE)
# if more than one pollutant, need to stack the data and set type =
# "variable" this case is most relevent for model-measurement
# compasrions where data are in columns Can also do more than one
# pollutant and a single type that is not "default", in which case
# pollutant becomes a conditioning variable
# if statistic is 'r' we need the data for two pollutants in columns
if (length(pollutant) > 1 && statistic != "r") {
if (length(type) > 1) {
warning(paste("Only type = '", type[1], "' will be used", sep = ""))
type <- type[1]
}
## use pollutants as conditioning variables
mydata <- gather(mydata, key = variable, value = value, one_of(pollutant))
## now set pollutant to "value"
pollutant <- "value"
if (type == "default") {
type <- "variable"
} else {
type <- c(type, "variable")
}
}
## cutData depending on type
mydata <- cutData(mydata, type, ...)
## if upper ws not set, set it to the max to display all information
max.ws <- max(mydata[[x]], na.rm = TRUE)
min.ws <- min(mydata[[x]], na.rm = TRUE)
clip <- TRUE ## used for removing data where ws > upper
if (missing(upper)) {
upper <- max.ws
clip <- FALSE
}
## for resolution of grid plotting (default = 101; fine =201)
if (resolution == "normal") int <- 101
if (resolution == "fine") int <- 201
if (resolution == "ultra.fine") int <- 401 ## very large files!
## binning wd data properly
## use 10 degree binning of wd if already binned, else 5
if (all(mydata[[wd]] %% 10 == 0, na.rm = TRUE)) {
wd.int <- 10
} else {
wd.int <- 5 ## how to split wd
}
ws.seq <- seq(min.ws, max.ws, length = 30)
wd.seq <- seq(from = wd.int, to = 360, by = wd.int) ## wind directions from wd.int to 360
ws.wd <- expand.grid(x = ws.seq, wd = wd.seq)
u <- with(ws.wd, x * sin(pi * wd / 180)) ## convert to polar coords
v <- with(ws.wd, x * cos(pi * wd / 180))
## data to predict over
input.data <- expand.grid(u = seq(-upper, upper, length = int),
v = seq(-upper, upper, length = int))
if (statistic == "cpf") {
## can be interval of percentiles or a single (threshold)
if (length(percentile) > 1) {
statistic <- "cpfi" # CPF interval
if (length(percentile) == 3) {
## in this case there is a trim value as a proprtion of the mean
## if this value <0 use absolute values as range
Mean <- mean(mydata[[pollutant]], na.rm = TRUE)
if (percentile[3] < 0) {
Pval <- percentile[1:2]
} else {
Pval <- quantile(subset(mydata[[pollutant]],
mydata[[pollutant]] >= Mean *
percentile[3]),
probs = percentile[1:2] / 100,
na.rm = TRUE)
}
} else {
Pval <- quantile(mydata[[pollutant]],
probs = percentile / 100, na.rm = TRUE)
}
sub <- paste("CPF (", format(Pval[1], digits = 2), " to ",
format(Pval[2], digits = 2), ")", sep = "")
} else {
Pval <- quantile(mydata[, pollutant], probs = percentile / 100, na.rm = TRUE)
sub <- paste("CPF at the ", percentile,
"th percentile (=", format(Pval, digits = 2), ")", sep = "")
}
} else {
sub <- NULL
}
prepare.grid <- function(mydata) {
## identify which ws and wd bins the data belong
wd <- cut(wd.int * ceiling(mydata[[wd]] / wd.int - 0.5),
breaks = seq(0, 360, wd.int), include.lowest = TRUE)
x <- cut(mydata[[x]], breaks = seq(0, max.ws, length = 31),
include.lowest = TRUE)
if (statistic != "r") {
binned <- switch(
statistic,
frequency = tapply(mydata[[pollutant]], list(wd, x), function(x)
length(na.omit(x))),
mean = tapply(mydata[[pollutant]], list(wd, x), function(x)
mean(x, na.rm = TRUE)),
median = tapply(mydata[[pollutant]], list(wd, x), function(x)
median(x, na.rm = TRUE)),
max = tapply(mydata[[pollutant]], list(wd, x), function(x)
max(x, na.rm = TRUE)),
stdev = tapply(mydata[[pollutant]], list(wd, x), function(x)
sd(x, na.rm = TRUE)),
cpf = tapply(mydata[[pollutant]], list(wd, x),
function(x)
(length(which(
x > Pval
)) / length(x))),
cpfi = tapply(mydata[[pollutant]], list(wd, x),
function(x)
(length(
which(x > Pval[1] & x <= Pval[2])
) / length(x))),
weighted.mean = tapply(mydata[[pollutant]], list(wd, x),
function(x)
(mean(x) * length(x) / nrow(mydata))),
percentile = tapply(mydata[[pollutant]], list(wd, x), function(x)
quantile(x, probs = percentile / 100, na.rm = TRUE))
)
binned <- as.vector(t(binned))
# statistic is the weighted correlation coefficient
} else {
binned <- rowwise(ws.wd) %>%
do(calCor(., mydata, x = nam.x, y = nam.wd,
pol_1 = pollutant[1], pol_2 = pollutant[2]))
binned <- binned$r
}
## frequency - remove points with freq < min.bin
bin.len <- tapply(mydata[[pollutant[1]]], list(x, wd), length)
binned.len <- as.vector(bin.len)
## apply weights
W <- rep(1, times = length(binned))
ids <- which(binned.len == 1)
W[ids] <- W[ids] * weights[1]
ids <- which(binned.len == 2)
W[ids] <- W[ids] * weights[2]
ids <- which(binned.len == 3)
W[ids] <- W[ids] * weights[3]
## set missing to NA
ids <- which(binned.len < min.bin)
binned[ids] <- NA
# for removing missing data later
binned.len[ids] <- NA
if (force.positive) n <- 0.5 else n <- 1
## no uncertainty to calculate
if (!uncertainty) {
## catch errors when not enough data to calculate surface
Mgam <- try(gam(binned ^ n ~ s(u, v, k = k), weights = W), TRUE)
if (!inherits(Mgam, "try-error")) {
pred <- predict.gam(Mgam, input.data)
pred <- pred ^ (1 / n)
pred <- as.vector(pred)
results <- data.frame(u = input.data$u, v = input.data$v, z = pred)
} else {
results <- data.frame(u = u, v = v, z = binned)
exclude.missing <- FALSE
warning(call. = FALSE, paste("Not enough data to fit surface.\nTry reducing the value of the smoothing parameter, k to less than ", k, ".", sep = ""))
}
} else {
## uncertainties calculated, weighted by number of points in each bin
Mgam <- gam(binned ^ n ~ s(u, v, k = k), weights = binned.len)
pred <- predict.gam(Mgam, input.data, se.fit = TRUE)
uncer <- 2 * as.vector(pred[[2]]) ## for approx 95% CI
pred <- as.vector(pred[[1]]) ^ (1 / n)
## do not weight for central prediction
Mgam <- gam(binned ^ n ~ s(u, v, k = k))
pred <- predict.gam(Mgam, input.data)
pred <- as.vector(pred)
Lower <- (pred - uncer) ^ (1 / n)
Upper <- (pred + uncer) ^ (1 / n)
pred <- pred ^ (1 / n)
n <- length(pred)
results <-
data.frame(u = rep(input.data$u, 3), v = rep(input.data$v, 3),
z = c(pred, Lower, Upper),
uncertainty = rep(c("prediction", "lower uncertainty",
"upper uncertainty"), each = n))
}
## function to remove points too far from original data
exclude <- function(results) {
## exclude predictions too far from data (from mgcv)
x <- seq(-upper, upper, length = int)
y <- x
res <- int
wsp <- rep(x, res)
wdp <- rep(y, rep(res, res))
## data with gaps caused by min.bin
all.data <- na.omit(data.frame(u, v, binned.len))
ind <- with(all.data, exclude.too.far(wsp, wdp, u, v, dist = 0.05))
results$z[ind] <- NA
results
}
if (exclude.missing) results <- exclude(results)
results
}
## if min.bin >1 show the missing data. Work this out by running twice:
## first time with no missings, second with min.bin.
## Plot one surface on top of the other.
if (!missing(min.bin)) {
tmp <- min.bin
min.bin <- 0
res1 <- group_by_(mydata, .dots = type) %>%
do(prepare.grid(.))
min.bin <- tmp
res <- group_by_(mydata, .dots = type) %>%
do(prepare.grid(.))
res$miss <- res1$z
} else {
res <- group_by_(mydata, .dots = type) %>%
do(prepare.grid(.))
}
## with CPF make sure not >1 due to surface fitting
if (any(res$z > 1, na.rm = TRUE) & statistic %in% c("cpf", "cpfi")) {
id <- which(res$z > 1)
res$z[id] <- 1
}
## remove wind speeds > upper to make a circle
if (clip) res$z[(res$u ^ 2 + res$v ^ 2) ^ 0.5 > upper] <- NA
if (uncertainty) type <- "uncertainty"
## proper names of labelling
strip.dat <- strip.fun(res, type, auto.text)
pol.name <- strip.dat[[1]]
pol.name2 <- strip.dat[[2]]
# proper names for strip labels
levels(res[[type[1]]]) <- pol.name
if (length(type) == 2L) levels(res[[type[2]]]) <- pol.name2
## normalise by divining by mean conditioning value if needed
if (normalise){
res <- mutate(res, z = z / mean(z, na.rm = TRUE))
if (missing(key.footer)) key.footer <- "normalised \nlevel"
}
# correlation notation
if (statistic == "r") {
if (missing(key.footer))
key.footer <-
paste0("corr(", pollutant[1], ", ", pollutant[2], ")")
# make sure smoothing does not results in r>1 or <-1
# sometimes happens with little data at edges
id <- which(res$z > 1)
if (length(id) > 0) res$z[id] <- 1
id <- which(res$z < -1)
if (length(id) > 0) res$z[id] <- -1
}
## auto-scaling
nlev <- 200 ## preferred number of intervals
## handle missing breaks arguments
if (missing(limits)) {
# breaks <- pretty(res$z, n = nlev)
breaks <- seq(min(res$z, na.rm = TRUE), max(res$z, na.rm = TRUE),
length.out = nlev)
} else {
## handle user limits and clipping
breaks <- seq(min(limits), max(limits), length.out = nlev)
## case where user max is < data max
if (max(limits) < max(res[["z"]], na.rm = TRUE)) {
id <- which(res[["z"]] > max(limits))
res[["z"]][id] <- max(limits)
}
## case where user min is > data min
if (min(limits) > min(res[["z"]], na.rm = TRUE)) {
id <- which(res[["z"]] < min(limits))
res[["z"]][id] <- min(limits)
}
}
nlev2 <- length(breaks)
col <- openColours(cols, (nlev2 - 1))
col.scale <- breaks
## special handling of layout for uncertainty
if (uncertainty & is.null(extra.args$layout)) {
extra.args$layout <- c(3, 1)
}
## scaling
## scaling of 'zeroed' data
## note - add upper because user can set this to be different to data
intervals <- pretty(c(mydata[[x]], upper))
## labels for scaling
labels <- pretty(c(mydata[[x]], upper) + min.scale)
## offset the lines/labels if necessary
intervals <- intervals + (min(labels) - min.scale)
## add zero in the middle if it exists
if (min.scale != 0){
labels <- labels[-1]
intervals <- intervals[-1]
}
# if user does not set limits, this just results in data limits being used
if (length(limits) == 1 && is.na(limits[1]))
limits <- c(NA, NA)
angles <- seq(0, 2 * pi, length = 360)
themax <- upper
datapoly <- data.frame(x = themax * sin(angles),
y = themax * cos(angles))
datalines <- data.frame(value = rep(intervals, each = 360),
x = rep(intervals, each = 360) * sin(angles),
y = rep(intervals, each = 360) * cos(angles))
data.axes <- data.frame(x1 = c(-upper, 0), y1 = c(0, -upper),
x2 = c(upper, 0), y2 = c(0, upper))
lab.interval <- intervals[-length(intervals)]
data.scale <- data.frame(x = 1.07 * lab.interval * sin(pi * angle.scale / 180),
y = 1.07 * lab.interval * cos(pi * angle.scale / 180),
label = labels[-length(labels)])
# polar plot test
plt <- ggplot(res, aes(x = u, y = v)) +
theme_void() +
coord_cartesian(xlim = c(-upper * 1.025, upper * 1.025),
ylim = c(-upper * 1.025, upper * 1.025)) +
geom_polygon(data = datapoly, aes(x = x, y = y), color = "grey95",
fill = "grey95")
if ("miss" %in% names(res))
plt <- plt + geom_raster(data = filter(res, !is.na(miss)), aes(x = u, y = v),
fill = mis.col,
inherit.aes = FALSE)
plt <- plt +
geom_raster(na.rm = FALSE, aes(fill = z)) +
geom_path(data = datalines, aes(x = x, y = y), color = "white") +
geom_segment(data = data.axes, aes(x = x1, y = y1, xend = x2, yend = y2),
color = "white") +
geom_text(data = data.scale, aes(x = x, y = y, label = label),
hjust = "left", size = 4) +
annotate("text", upper * -1 * 0.95, 0, label = "W") +
annotate("text", 0, upper * -1 * 0.95, label = "S") +
annotate("text", 0, upper * 0.95, label = "N") +
annotate("text", upper * 0.95, 0, label = "E") +
scale_fill_gradientn(colours = openColours(cols, 100),
na.value = "transparent", limits = limits) +
guides(fill = guide_colourbar(title = quickText(key.footer),
barheight = unit(0.5, "npc"))) +
theme(aspect.ratio = 1, strip.background = element_rect(fill = "grey"))
if (!"default" %in% type) {
if (length(type) == 1L)
plt <- plt + facet_wrap(reformulate(type), labeller = label_parsed)
if (length(type) == 2L)
plt <- plt + facet_grid(paste(type[2], "~", type[1]), labeller = label_parsed)
}
plot(plt)
newdata <- res
output <- list(plot = plt, data = newdata, call = match.call())
class(output) <- "openair"
invisible(output)
}
calCor <- function(data, mydata, x = "ws", y = "wd", pol_1 = "nox",
pol_2 = "pm10") {
# function to work out weighted correlation between two surfaces
# uses Gaussian kernel smoothing to weight wd/ws
ws1 <- data[[1]] # centre of ws
wd1 <- data[[2]] # centre of wd
# scale ws, note 20 gives sensible spread for kernel
mydata$ws.scale <- 20 * (mydata[[x]] - ws1) /
(max(mydata[[x]], na.rm = TRUE) - min(mydata[[x]], na.rm = TRUE))
# Gaussian kernel
mydata$ws.scale <- (2 * pi) ^ -0.5 * exp(-0.5 * mydata$ws.scale ^ 2)
# wd direction scaling, use Gaussian kernel
mydata$wd.scale <- mydata[[y]] - wd1
id <- which(mydata$wd.scale < 0)
if (length(id) > 0) mydata$wd.scale[id] <- mydata$wd.scale[id] + 360
id <- which(mydata$wd.scale > 180)
if (length(id) > 0) mydata$wd.scale[id] <- abs(mydata$wd.scale[id] - 360)
# note, 4 gives a sensible spread for kernel
mydata$wd.scale <- 4 * mydata$wd.scale * 2 * pi / 360
mydata$wd.scale <- (2 * pi) ^ -0.5 * exp(-0.5 * mydata$wd.scale ^ 2)
# final weighting multiplies two kernels for ws and wd
mydata$weight <- mydata$ws.scale * mydata$wd.scale
mydata$weight <- mydata$weight / max(mydata$weight, na.rm = TRUE)
# the data we need
thedata <- select_(mydata, pol_1, pol_2, "weight")
thedata <- na.omit(thedata)
# useful for showing what the weighting looks like as a surface
# scatterPlot(mydata, x= "ws", y = "wd", z = "weight", method = "level", col = "jet")
# call weighted Pearson correlation
r <- corr(cbind(thedata[[pol_1]], thedata[[pol_2]]),
w = thedata$weight)
result <- data.frame(ws1, wd1, r)
return(result)
}
# taken directly from the boot package to save importing
corr <- function(d, w = rep(1, nrow(d)) / nrow(d))
{
s <- sum(w)
m1 <- sum(d[, 1L] * w) / s
m2 <- sum(d[, 2L] * w) / s
(sum(d[, 1L] * d[, 2L] * w) / s - m1 * m2) /
sqrt((sum(d[, 1L] ^ 2 * w) / s - m1 ^ 2) *
(sum(d[, 2L] ^ 2 * w) / s - m2 ^ 2))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.