Nothing
#' Function for plotting bivariate polar plots with smoothing.
#'
#' Function for plotting pollutant concentration in polar coordinates showing
#' concentration by wind speed (or another numeric variable) and direction. Mean
#' concentrations are calculated for wind speed-direction \sQuote{bins} (e.g.
#' 0-1, 1-2 m/s,... and 0-10, 10-20 degrees etc.). To aid interpretation,
#' \code{gam} smoothing is carried out using \code{mgcv}.
#'
#' The bivariate polar plot is a useful diagnostic tool for quickly gaining an
#' idea of potential sources. Wind speed is one of the most useful variables to
#' use to separate source types (see references). For example, ground-level
#' concentrations resulting from buoyant plumes from chimney stacks tend to peak
#' under higher wind speed conditions. Conversely, ground-level, non-buoyant
#' plumes such as from road traffic, tend to have highest concentrations under
#' low wind speed conditions. Other sources such as from aircraft engines also
#' show differing characteristics by wind speed.
#'
#' The function has been developed to allow variables other than wind speed to
#' be plotted with wind direction in polar coordinates. The key issue is that
#' the other variable plotted against wind direction should be discriminating in
#' some way. For example, temperature can help reveal high-level sources brought
#' down to ground level in unstable atmospheric conditions, or show the effect a
#' source emission dependent on temperature e.g. biogenic isoprene.
#'
#' The plots can vary considerably depending on how much smoothing is done. The
#' approach adopted here is based on the very flexible and capable \code{mgcv}
#' package that uses \emph{Generalized Additive Models}. While methods do exist
#' to find an optimum level of smoothness, they are not necessarily useful. The
#' principal aim of \code{polarPlot} is as a graphical analysis rather than for
#' quantitative purposes. In this respect the smoothing aims to strike a balance
#' between revealing interesting (real) features and overly noisy data. The
#' defaults used in [polarPlot()] are based on the analysis of data from many
#' different sources. More advanced users may wish to modify the code and adopt
#' other smoothing approaches.
#'
#' Various statistics are possible to consider e.g. mean, maximum, median.
#' \code{statistic = "max"} is often useful for revealing sources. Pair-wise
#' statistics between two pollutants can also be calculated.
#'
#' The function can also be used to compare two pollutant species through a
#' range of pair-wise statistics (see help on \code{statistic}) and Grange et
#' al. (2016) (open-access publication link below).
#'
#' Wind direction is split up into 10 degree intervals and the other variable
#' (e.g. wind speed) 30 intervals. These 2D bins are then used to calculate the
#' statistics.
#'
#' These plots often show interesting features at higher wind speeds (see
#' references below). For these conditions there can be very few measurements
#' and therefore greater uncertainty in the calculation of the surface. There
#' are several ways in which this issue can be tackled. First, it is possible to
#' avoid smoothing altogether and use [polarFreq()]. Second, the effect of
#' setting a minimum number of measurements in each wind speed-direction bin can
#' be examined through \code{min.bin}. It is possible that a single point at
#' high wind speed conditions can strongly affect the surface prediction.
#' Therefore, setting \code{min.bin = 3}, for example, will remove all wind
#' speed-direction bins with fewer than 3 measurements \emph{before} fitting the
#' surface. Third, consider setting \code{uncertainty = TRUE}. This option will
#' show the predicted surface together with upper and lower 95% confidence
#' intervals, which take account of the frequency of measurements.
#'
#' Variants on \code{polarPlot} include [polarAnnulus()] and [polarFreq()].
#'
#' @param mydata A data frame minimally containing \code{wd}, another variable
#' to plot in polar coordinates (the default is a column \dQuote{ws} --- wind
#' speed) and a pollutant. Should also contain \code{date} if plots by time
#' period are required.
#'
#' @param pollutant Mandatory. A pollutant name corresponding to a variable in a
#' data frame should be supplied e.g. \code{pollutant = "nox"}. There can also
#' be more than one pollutant specified e.g. \code{pollutant = c("nox",
#' "no2")}. The main use of using two or more pollutants is for model
#' evaluation where two species would be expected to have similar
#' concentrations. This saves the user stacking the data and it is possible to
#' work with columns of data directly. A typical use would be \code{pollutant
#' = c("obs", "mod")} to compare two columns \dQuote{obs} (the observations)
#' and \dQuote{mod} (modelled values). When pair-wise statistics such as
#' Pearson correlation and regression techniques are to be plotted,
#' \code{pollutant} takes two elements too. For example, \code{pollutant =
#' c("bc", "pm25")} where \code{"bc"} is a function of \code{"pm25"}.
#'
#' @param x Name of variable to plot against wind direction in polar
#' coordinates, the default is wind speed, \dQuote{ws}.
#'
#' @param wd Name of wind direction field.
#'
#' @param type \code{type} determines how the data are split i.e. conditioned,
#' and then plotted. The default is will produce a single plot using the
#' entire data. Type can be one of the built-in types as detailed in
#' \code{cutData} e.g. \dQuote{season}, \dQuote{year}, \dQuote{weekday} and so
#' on. For example, \code{type = "season"} will produce four plots --- one for
#' each season.
#'
#' It is also possible to choose \code{type} as another variable in the data
#' frame. If that variable is numeric, then the data will be split into four
#' quantiles (if possible) and labelled accordingly. If type is an existing
#' character or factor variable, then those categories/levels will be used
#' directly. This offers great flexibility for understanding the variation of
#' different variables and how they depend on one another.
#'
#' Type can be up length two e.g. \code{type = c("season", "weekday")} will
#' produce a 2x2 plot split by season and day of the week. Note, when two
#' types are provided the first forms the columns and the second the rows.
#'
#' @param statistic The statistic that should be applied to each wind
#' speed/direction bin. Because of the smoothing involved, the colour scale
#' for some of these statistics is only to provide an indication of overall
#' pattern and should not be interpreted in concentration units e.g. for
#' \code{statistic = "weighted.mean"} where the bin mean is multiplied by the
#' bin frequency and divided by the total frequency. In many cases using
#' \code{polarFreq} will be better. Setting \code{statistic = "weighted.mean"}
#' can be useful because it provides an indication of the concentration *
#' frequency of occurrence and will highlight the wind speed/direction
#' conditions that dominate the overall mean.Can be:
#'
#' \itemize{ \item \dQuote{mean} (default), \dQuote{median}, \dQuote{max}
#' (maximum), \dQuote{frequency}. \dQuote{stdev} (standard deviation),
#' \dQuote{weighted.mean}.
#'
#' \item \code{statistic = "nwr"} Implements the Non-parametric Wind
#' Regression approach of Henry et al. (2009) that uses kernel smoothers. The
#' \code{openair} implementation is not identical because Gaussian kernels are
#' used for both wind direction and speed. The smoothing is controlled by
#' \code{ws_spread} and \code{wd_spread}.
#'
#' \item \code{statistic = "cpf"} the conditional probability function (CPF)
#' is plotted and a single (usually high) percentile level is supplied. The
#' CPF is defined as CPF = my/ny, where my is the number of samples in the y
#' bin (by default a wind direction, wind speed interval) with mixing ratios
#' greater than the \emph{overall} percentile concentration, and ny is the
#' total number of samples in the same wind sector (see Ashbaugh et al.,
#' 1985). Note that percentile intervals can also be considered; see
#' \code{percentile} for details.
#'
#' \item When \code{statistic = "r"} or \code{statistic = "Pearson"}, the
#' Pearson correlation coefficient is calculated for \emph{two} pollutants.
#' The calculation involves a weighted Pearson correlation coefficient, which
#' is weighted by Gaussian kernels for wind direction an the radial variable
#' (by default wind speed). More weight is assigned to values close to a wind
#' speed-direction interval. Kernel weighting is used to ensure that all data
#' are used rather than relying on the potentially small number of values in a
#' wind speed-direction interval.
#'
#' \item When \code{statistic = "Spearman"}, the Spearman correlation
#' coefficient is calculated for \emph{two} pollutants. The calculation
#' involves a weighted Spearman correlation coefficient, which is weighted by
#' Gaussian kernels for wind direction an the radial variable (by default wind
#' speed). More weight is assigned to values close to a wind speed-direction
#' interval. Kernel weighting is used to ensure that all data are used rather
#' than relying on the potentially small number of values in a wind
#' speed-direction interval.
#'
#' \item \code{"robust_slope"} is another option for pair-wise statistics and
#' \code{"quantile.slope"}, which uses quantile regression to estimate the
#' slope for a particular quantile level (see also \code{tau} for setting the
#' quantile level).
#'
#' \item \code{"york_slope"} is another option for pair-wise statistics which
#' uses the \emph{York regression method} to estimate the slope. In this
#' method the uncertainties in \code{x} and \code{y} are used in the
#' determination of the slope. The uncertainties are provided by
#' \code{x_error} and \code{y_error} --- see below.}
#'
#' @param limits The function does its best to choose sensible limits
#' automatically. However, there are circumstances when the user will wish to
#' set different ones. An example would be a series of plots showing each year
#' of data separately. The limits are set in the form \code{c(lower, upper)},
#' so \code{limits = c(0, 100)} would force the plot limits to span 0-100.
#'
#' @param exclude.missing Setting this option to \code{TRUE} (the default)
#' removes points from the plot that are too far from the original data. The
#' smoothing routines will produce predictions at points where no data exist
#' i.e. they predict. By removing the points too far from the original data
#' produces a plot where it is clear where the original data lie. If set to
#' \code{FALSE} missing data will be interpolated.
#'
#' @param uncertainty Should the uncertainty in the calculated surface be shown?
#' If \code{TRUE} three plots are produced on the same scale showing the
#' predicted surface together with the estimated lower and upper uncertainties
#' at the 95% confidence interval. Calculating the uncertainties is useful to
#' understand whether features are real or not. For example, at high wind
#' speeds where there are few data there is greater uncertainty over the
#' predicted values. The uncertainties are calculated using the GAM and
#' weighting is done by the frequency of measurements in each wind
#' speed-direction bin. Note that if uncertainties are calculated then the
#' type is set to "default".
#'
#' @param percentile If \code{statistic = "percentile"} then \code{percentile}
#' is used, expressed from 0 to 100. Note that the percentile value is
#' calculated in the wind speed, wind direction \sQuote{bins}. For this reason
#' it can also be useful to set \code{min.bin} to ensure there are a
#' sufficient number of points available to estimate a percentile. See
#' \code{quantile} for more details of how percentiles are calculated.
#'
#' \code{percentile} is also used for the Conditional Probability Function
#' (CPF) plots. \code{percentile} can be of length two, in which case the
#' percentile \emph{interval} is considered for use with CPF. For example,
#' \code{percentile = c(90, 100)} will plot the CPF for concentrations between
#' the 90 and 100th percentiles. Percentile intervals can be useful for
#' identifying specific sources. In addition, \code{percentile} can also be of
#' length 3. The third value is the \sQuote{trim} value to be applied. When
#' calculating percentile intervals many can cover very low values where there
#' is no useful information. The trim value ensures that values greater than
#' or equal to the trim * mean value are considered \emph{before} the
#' percentile intervals are calculated. The effect is to extract more detail
#' from many source signatures. See the manual for examples. Finally, if the
#' trim value is less than zero the percentile range is interpreted as
#' absolute concentration values and subsetting is carried out directly.
#'
#' @param cols Colours to be used for plotting. Options include
#' \dQuote{default}, \dQuote{increment}, \dQuote{heat}, \dQuote{jet} and
#' \code{RColorBrewer} colours --- see the \code{openair} \code{openColours}
#' function for more details. For user defined the user can supply a list of
#' colour names recognised by R (type \code{colours()} to see the full list).
#' An example would be \code{cols = c("yellow", "green", "blue")}. \code{cols}
#' can also take the values \code{"viridis"}, \code{"magma"},
#' \code{"inferno"}, or \code{"plasma"} which are the viridis colour maps
#' ported from Python's Matplotlib library.
#'
#' @param weights At the edges of the plot there may only be a few data points
#' in each wind speed-direction interval, which could in some situations
#' distort the plot if the concentrations are high. \code{weights} applies a
#' weighting to reduce their influence. For example and by default if only a
#' single data point exists then the weighting factor is 0.25 and for two
#' points 0.5. To not apply any weighting and use the data as is, use
#' \code{weights = c(1, 1, 1)}.
#'
#' An alternative to down-weighting these points they can be removed
#' altogether using \code{min.bin}.
#'
#' @param min.bin The minimum number of points allowed in a wind speed/wind
#' direction bin. The default is 1. A value of two requires at least 2 valid
#' records in each bin an so on; bins with less than 2 valid records are set
#' to NA. Care should be taken when using a value > 1 because of the risk of
#' removing real data points. It is recommended to consider your data with
#' care. Also, the \code{polarFreq} function can be of use in such
#' circumstances.
#'
#' @param mis.col When \code{min.bin} is > 1 it can be useful to show where data
#' are removed on the plots. This is done by shading the missing data in
#' \code{mis.col}. To not highlight missing data when \code{min.bin} > 1
#' choose \code{mis.col = "transparent"}.
#'
#' @param alpha The alpha transparency to use for the plotting surface (a value
#' between 0 and 1 with zero being fully transparent and 1 fully opaque).
#' Setting a value below 1 can be useful when plotting surfaces on a map using
#' the package \code{openairmaps}.
#'
#' @param upper This sets the upper limit wind speed to be used. Often there are
#' only a relatively few data points at very high wind speeds and plotting all
#' of them can reduce the useful information in the plot.
#'
#' @param angle.scale Sometimes the placement of the scale may interfere with an
#' interesting feature. The user can therefore set \code{angle.scale} to any
#' value between 0 and 360 degrees to mitigate such problems. For example
#' \code{angle.scale = 45} will draw the scale heading in a NE direction.
#'
#' @param units The units shown on the polar axis scale.
#'
#' @param force.positive The default is \code{TRUE}. Sometimes if smoothing data
#' with steep gradients it is possible for predicted values to be negative.
#' \code{force.positive = TRUE} ensures that predictions remain positive. This
#' is useful for several reasons. First, with lots of missing data more
#' interpolation is needed and this can result in artefacts because the
#' predictions are too far from the original data. Second, if it is known
#' beforehand that the data are all positive, then this option carries that
#' assumption through to the prediction. The only likely time where setting
#' \code{force.positive = FALSE} would be if background concentrations were
#' first subtracted resulting in data that is legitimately negative. For the
#' vast majority of situations it is expected that the user will not need to
#' alter the default option.
#'
#' @param k This is the smoothing parameter used by the \code{gam} function in
#' package \code{mgcv}. Typically, value of around 100 (the default) seems to
#' be suitable and will resolve important features in the plot. The most
#' appropriate choice of \code{k} is problem-dependent; but extensive testing
#' of polar plots for many different problems suggests a value of \code{k} of
#' about 100 is suitable. Setting \code{k} to higher values will not tend to
#' affect the surface predictions by much but will add to the computation
#' time. Lower values of \code{k} will increase smoothing. Sometimes with few
#' data to plot \code{polarPlot} will fail. Under these circumstances it can
#' be worth lowering the value of \code{k}.
#'
#' @param normalise If \code{TRUE} concentrations are normalised by dividing by
#' their mean value. This is done \emph{after} fitting the smooth surface.
#' This option is particularly useful if one is interested in the patterns of
#' concentrations for several pollutants on different scales e.g. NOx and CO.
#' Often useful if more than one \code{pollutant} is chosen.
#'
#' @param key.header Adds additional text/labels to the scale key. For example,
#' passing the options \code{key.header = "header", key.footer = "footer1"}
#' adds addition text above and below the scale key. These arguments are
#' passed to \code{drawOpenKey} via \code{quickText}, applying the
#' \code{auto.text} argument, to handle formatting.
#'
#' @param key.footer see \code{key.footer}.
#'
#' @param key.position Location where the scale key is to plotted. Allowed
#' arguments currently include \code{"top"}, \code{"right"}, \code{"bottom"}
#' and \code{"left"}.
#'
#' @param key Fine control of the scale key via \code{drawOpenKey}. See
#' \code{drawOpenKey} for further details.
#'
#' @param auto.text Either \code{TRUE} (default) or \code{FALSE}. If \code{TRUE}
#' titles and axis labels will automatically try and format pollutant names
#' and units properly e.g. by subscripting the `2' in NO2.
#'
#' @param ws_spread The value of sigma used for Gaussian kernel weighting of
#' wind speed when \code{statistic = "nwr"} or when correlation and regression
#' statistics are used such as \emph{r}. Default is \code{0.5}.
#'
#' @param wd_spread The value of sigma used for Gaussian kernel weighting of
#' wind direction when \code{statistic = "nwr"} or when correlation and
#' regression statistics are used such as \emph{r}. Default is \code{4}.
#'
#' @param x_error The \code{x} error / uncertainty used when \code{statistic =
#' "york_slope"}.
#'
#' @param y_error The \code{y} error / uncertainty used when \code{statistic =
#' "york_slope"}.
#'
#' @param kernel Type of kernel used for the weighting procedure for when
#' correlation or regression techniques are used. Only \code{"gaussian"} is
#' supported but this may be enhanced in the future.
#'
#' @param formula.label When pair-wise statistics such as regression slopes are
#' calculated and plotted, should a formula label be displayed?
#'
#' @param tau The quantile to be estimated when \code{statistic} is set to
#' \code{"quantile.slope"}. Default is \code{0.5} which is equal to the median
#' and will be ignored if \code{"quantile.slope"} is not used.
#'
#' @param plot Should a plot be produced? \code{FALSE} can be useful when
#' analysing data to extract plot components and plotting them in other ways.
#'
#' @param ... Other graphical parameters passed onto \code{lattice:levelplot}
#' and \code{cutData}. For example, \code{polarPlot} passes the option
#' \code{hemisphere = "southern"} on to \code{cutData} to provide southern
#' (rather than default northern) hemisphere handling of \code{type =
#' "season"}. Similarly, common axis and title labelling options (such as
#' \code{xlab}, \code{ylab}, \code{main}) are passed to \code{levelplot} via
#' \code{quickText} to handle routine formatting.
#'
#' @import lattice
#' @import mgcv
#' @return an [openair][openair-package] object. `data` contians four set
#' columns: \code{cond}, conditioning based on \code{type}; \code{u} and
#' \code{v}, the translational vectors based on \code{ws} and \code{wd}; and
#' the local \code{pollutant} estimate.
#' @author David Carslaw
#' @family polar directional analysis functions
#' @references
#'
#' Ashbaugh, L.L., Malm, W.C., Sadeh, W.Z., 1985. A residence time probability
#' analysis of sulfur concentrations at ground canyon national park. Atmospheric
#' Environment 19 (8), 1263-1270.
#'
#' Carslaw, D.C., Beevers, S.D, Ropkins, K and M.C. Bell (2006). Detecting and
#' quantifying aircraft and other on-airport contributions to ambient nitrogen
#' oxides in the vicinity of a large international airport. Atmospheric
#' Environment. 40/28 pp 5424-5434.
#'
#' Carslaw, D.C., & Beevers, S.D. (2013). Characterising and understanding
#' emission sources using bivariate polar plots and k-means clustering.
#' Environmental Modelling & Software, 40, 325-329.
#' doi:10.1016/j.envsoft.2012.09.005
#'
#' Henry, R.C., Chang, Y.S., Spiegelman, C.H., 2002. Locating nearby sources of
#' air pollution by nonparametric regression of atmospheric concentrations on
#' wind direction. Atmospheric Environment 36 (13), 2237-2244.
#'
#' Henry, R., Norris, G.A., Vedantham, R., Turner, J.R., 2009. Source region
#' identification using Kernel smoothing. Environ. Sci. Technol. 43 (11),
#' 4090e4097. http:// dx.doi.org/10.1021/es8011723.
#'
#' Uria-Tellaetxe, I. and D.C. Carslaw (2014). Source identification using a
#' conditional bivariate Probability function. Environmental Modelling &
#' Software, Vol. 59, 1-9.
#'
#' Westmoreland, E.J., N. Carslaw, D.C. Carslaw, A. Gillah and E. Bates (2007).
#' Analysis of air quality within a street canyon using statistical and
#' dispersion modelling techniques. Atmospheric Environment. Vol. 41(39), pp.
#' 9195-9205.
#'
#' Yu, K.N., Cheung, Y.P., Cheung, T., Henry, R.C., 2004. Identifying the impact
#' of large urban airports on local air quality by nonparametric regression.
#' Atmospheric Environment 38 (27), 4501-4507.
#'
#' Grange, S. K., Carslaw, D. C., & Lewis, A. C. 2016. Source apportionment
#' advances with bivariate polar plots, correlation, and regression techniques.
#' Atmospheric Environment. 145, 128-134.
#' \url{https://www.sciencedirect.com/science/article/pii/S1352231016307166}
#' @examples
#'
#' # Use openair 'mydata'
#'
#' # basic plot
#' polarPlot(openair::mydata, pollutant = "nox")
#' \dontrun{
#'
#' # polarPlots by year on same scale
#' polarPlot(mydata, pollutant = "so2", type = "year", main = "polarPlot of so2")
#'
#' # set minimum number of bins to be used to see if pattern remains similar
#' polarPlot(mydata, pollutant = "nox", min.bin = 3)
#'
#' # plot by day of the week
#' polarPlot(mydata, pollutant = "pm10", type = "weekday")
#'
#' # show the 95% confidence intervals in the surface fitting
#' polarPlot(mydata, pollutant = "so2", uncertainty = TRUE)
#'
#'
#' # Pair-wise statistics
#' # Pearson correlation
#' polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "r")
#'
#' # Robust regression slope, takes a bit of time
#' polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "robust.slope")
#'
#' # Least squares regression works too but it is not recommended, use robust
#' # regression
#' # polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "slope")
#' }
#'
#' @export
polarPlot <-
function(mydata,
pollutant = "nox",
x = "ws",
wd = "wd",
type = "default",
statistic = "mean",
limits = NULL,
exclude.missing = TRUE,
uncertainty = FALSE,
percentile = NA,
cols = "default",
weights = c(0.25, 0.5, 0.75),
min.bin = 1,
mis.col = "grey",
upper = NA,
angle.scale = 315,
units = x,
force.positive = TRUE,
k = 100,
normalise = FALSE,
key.header = statistic,
key.footer = pollutant,
key.position = "right",
key = TRUE,
auto.text = TRUE,
ws_spread = 1.5,
wd_spread = 5,
x_error = NA,
y_error = NA,
kernel = "gaussian",
formula.label = TRUE,
tau = 0.5,
alpha = 1,
plot = TRUE,
...) {
## get rid of R check annoyances
z <- . <- NULL
if (statistic == "percentile" & is.na(percentile[1] & statistic != "cpf")) {
warning("percentile value missing, using 50")
percentile <- 50
}
if (statistic == "cpf" & is.na(percentile[1])) {
warning("percentile value missing, using 75")
percentile <- 75
}
# Allow for period notation
statistic <- gsub("\\.| ", "_", statistic)
# Build vector for many checks
correlation_stats <- c(
"r", "slope", "intercept", "robust_slope",
"robust_intercept", "quantile_slope",
"quantile_intercept", "Pearson", "Spearman", "york_slope", "trend"
)
if (statistic %in% correlation_stats && length(pollutant) != 2) {
stop("Correlation statistic requires two pollutants.")
}
# if statistic is trend, then don't force to be positive
if (statistic == "trend") {
force.positive <- FALSE
}
# names of variables for use later
nam.x <- x
nam.wd <- wd
if (length(type) > 2) stop("Maximum number of types is 2.")
## can't have conditioning here
if (uncertainty) type <- "default"
if (uncertainty & length(pollutant) > 1) {
stop("Can only have one pollutant when uncertainty = TRUE")
}
if (!statistic %in% c(
"mean", "median", "frequency", "max", "stdev", "trend",
"weighted_mean", "percentile", "cpf", "nwr", correlation_stats
)) {
stop(paste0("statistic '", statistic, "' not recognised."), call. = FALSE)
}
if (length(weights) != 3) stop("weights should be of length 3.")
if (key.header == "nwr") key.header <- "NWR"
if (key.header == "weighted_mean") key.header <- "weighted\nmean"
if (key.header == "percentile") {
key.header <- c(paste(percentile, "th", sep = ""), "percentile")
}
if ("cpf" %in% key.header) key.header <- c("CPF", "probability")
## greyscale handling
if (length(cols) == 1 && cols == "greyscale") {
trellis.par.set(list(strip.background = list(col = "white")))
}
## set graphics
current.strip <- trellis.par.get("strip.background")
current.font <- trellis.par.get("fontsize")
## reset graphic parameters
on.exit(trellis.par.set(
fontsize = current.font
))
## extra.args setup
extra.args <- list(...)
## label controls
extra.args$xlab <- if ("xlab" %in% names(extra.args)) {
quickText(extra.args$xlab, auto.text)
} else {
quickText("", auto.text)
}
extra.args$ylab <- if ("ylab" %in% names(extra.args)) {
quickText(extra.args$ylab, auto.text)
} else {
quickText("", auto.text)
}
extra.args$main <- if ("main" %in% names(extra.args)) {
quickText(extra.args$main, auto.text)
} else {
quickText("", auto.text)
}
if ("fontsize" %in% names(extra.args)) {
trellis.par.set(fontsize = list(text = extra.args$fontsize))
}
# if clustering, return lower resolution plot
extra.args$cluster <- if ("cluster" %in% names(extra.args)) {
TRUE
} else {
FALSE
}
## layout default
if (!"layout" %in% names(extra.args)) {
extra.args$layout <- NULL
}
## extract variables of interest
vars <- c(wd, x, pollutant)
if (statistic == "york_slope") {
vars <- c(vars, x_error, y_error)
}
if (any(type %in% dateTypes)) vars <- c(vars, "date")
mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)
mydata <- na.omit(mydata)
## this is used later for the 'x' scale
min.scale <- min(mydata[[x]], na.rm = TRUE)
# check to see if a high proportion of the data is negative
if (length(which(mydata[pollutant] < 0)) / nrow(mydata) > 0.1 && force.positive) {
warning(">10% negative data detected, set force.positive = FALSE?")
}
# scale data by subtracting the min value this helps with dealing
# with negative data on radial axis (starts from zero, always
# positive)
# return radial scale as attribute "radial_scale" on returned data for external plotting
radial_scale <- range(mydata[[x]])
mydata[[x]] <- mydata[[x]] - min(mydata[[x]], na.rm = TRUE)
# if more than one pollutant, need to stack the data and set type =
# "variable" this case is most relevent for model-measurement
# compasrions where data are in columns Can also do more than one
# pollutant and a single type that is not "default", in which case
# pollutant becomes a conditioning variable
# if statistic is 'r' we need the data for two pollutants in columns
if (length(pollutant) > 1 && !statistic %in% correlation_stats) {
if (length(type) > 1) {
warning(paste("Only type = '", type[1], "' will be used", sep = ""))
type <- type[1]
}
## use pollutants as conditioning variables
mydata <- gather(mydata,
key = variable, value = value, pollutant,
factor_key = TRUE
)
## now set pollutant to "value"
pollutant <- "value"
if (type == "default") {
type <- "variable"
} else {
type <- c(type, "variable")
}
}
## cutData depending on type
mydata <- cutData(mydata, type, ...)
## if upper ws not set, set it to the max to display all information
max.ws <- max(mydata[[x]], na.rm = TRUE)
min.ws <- min(mydata[[x]], na.rm = TRUE)
clip <- TRUE ## used for removing data where ws > upper
if (is.na(upper)) {
upper <- max.ws
clip <- FALSE
}
# resolution deprecated, int is resolution of GAM surface predictions over int * int grid
# 51 works well with bilinear interpolation of results
int <- 51
## binning wd data properly
## use 10 degree binning of wd if already binned, else 5
if (all(mydata[[wd]] %% 10 == 0, na.rm = TRUE)) {
wd.int <- 10
} else {
if (toupper(statistic) == "NWR") wd.int <- 2 else wd.int <- 5 ## how to split wd
}
if (toupper(statistic) == "NWR") ws_bins <- 40 else ws_bins <- 30
if (statistic == "nwr") k <- 200 # limit any smoothing
ws.seq <- seq(min.ws, max.ws, length = ws_bins)
wd.seq <- seq(from = wd.int, to = 360, by = wd.int) ## wind directions from wd.int to 360
ws.wd <- expand.grid(x = ws.seq, wd = wd.seq)
u <- with(ws.wd, x * sin(pi * wd / 180)) ## convert to polar coords
v <- with(ws.wd, x * cos(pi * wd / 180))
## data to predict over
input.data <- expand.grid(
u = seq(-upper, upper, length = int),
v = seq(-upper, upper, length = int)
)
if (statistic == "cpf") {
## can be interval of percentiles or a single (threshold)
if (length(percentile) > 1) {
statistic <- "cpfi" # CPF interval
if (length(percentile) == 3) {
## in this case there is a trim value as a proprtion of the mean
## if this value <0 use absolute values as range
Mean <- mean(mydata[[pollutant]], na.rm = TRUE)
if (percentile[3] < 0) {
Pval <- percentile[1:2]
} else {
Pval <- quantile(
subset(
mydata[[pollutant]],
mydata[[pollutant]] >= Mean *
percentile[3]
),
probs = percentile[1:2] / 100,
na.rm = TRUE
)
}
} else {
Pval <- quantile(
mydata[[pollutant]],
probs = percentile / 100, na.rm = TRUE
)
}
sub <- paste(
"CPF (", format(Pval[1], digits = 2), " to ",
format(Pval[2], digits = 2), ")",
sep = ""
)
} else {
Pval <- quantile(mydata[[pollutant]], probs = percentile / 100, na.rm = TRUE)
sub <- paste(
"CPF at the ", percentile,
"th percentile (=", format(Pval, digits = 2), ")",
sep = ""
)
}
} else {
sub <- NULL
}
prepare.grid <- function(mydata) {
## identify which ws and wd bins the data belong
wd <- cut(
wd.int * ceiling(mydata[[wd]] / wd.int - 0.5),
breaks = seq(0, 360, wd.int), include.lowest = TRUE
)
x <- cut(
mydata[[x]],
breaks = seq(0, max.ws, length = ws_bins + 1),
include.lowest = TRUE
)
if (!statistic %in% c(correlation_stats, "nwr", "trend")) {
binned <- switch(statistic,
frequency = tapply(mydata[[pollutant]], list(wd, x), function(x) {
length(na.omit(x))
}),
mean = tapply(mydata[[pollutant]], list(wd, x), function(x) {
mean(x, na.rm = TRUE)
}),
median = tapply(mydata[[pollutant]], list(wd, x), function(x) {
median(x, na.rm = TRUE)
}),
max = tapply(mydata[[pollutant]], list(wd, x), function(x) {
max(x, na.rm = TRUE)
}),
stdev = tapply(mydata[[pollutant]], list(wd, x), function(x) {
sd(x, na.rm = TRUE)
}),
cpf = tapply(
mydata[[pollutant]], list(wd, x),
function(x) {
(length(which(
x > Pval
)) / length(x))
}
),
cpfi = tapply(
mydata[[pollutant]], list(wd, x),
function(x) {
(length(
which(x > Pval[1] & x <= Pval[2])
) / length(x))
}
),
weighted_mean = tapply(
mydata[[pollutant]], list(wd, x),
function(x) {
(mean(x) * length(x) / nrow(mydata))
}
),
percentile = tapply(mydata[[pollutant]], list(wd, x), function(x) {
quantile(x, probs = percentile / 100, na.rm = TRUE)
})
)
binned <- as.vector(t(binned))
} else if (toupper(statistic) == "NWR") {
binned <- rowwise(ws.wd) %>%
summarise(simple_kernel(
across(.cols = everything()),
mydata,
x = nam.x, y = nam.wd, pollutant = pollutant,
ws_spread = ws_spread, wd_spread = wd_spread, kernel
))
binned <- binned$conc
} else if (toupper(statistic) == "TREND") {
binned <- rowwise(ws.wd) %>%
summarise(simple_kernel_trend(
across(.cols = everything()),
mydata,
x = nam.x, y = nam.wd, pollutant = pollutant, "date",
ws_spread = ws_spread, wd_spread = wd_spread, kernel,
tau = tau
))
binned <- binned$conc
} else {
binned <- rowwise(ws.wd) %>%
summarise(calculate_weighted_statistics(
across(.cols = everything()),
mydata,
statistic = statistic,
x = nam.x, y = nam.wd, pol_1 = pollutant[1], pol_2 = pollutant[2],
ws_spread = ws_spread, wd_spread = wd_spread, kernel, tau = tau,
x_error = x_error, y_error = y_error
))
# Get vector
binned <- binned$stat_weighted
# A catch for surface plotting
binned <- ifelse(binned == Inf, NA, binned)
}
# Building the plot begins...
## frequency - remove points with freq < min.bin
bin.len <- tapply(mydata[[pollutant[1]]], list(x, wd), length)
binned.len <- as.vector(bin.len)
## apply weights
W <- rep(1, times = length(binned))
ids <- which(binned.len == 1)
W[ids] <- W[ids] * weights[1]
ids <- which(binned.len == 2)
W[ids] <- W[ids] * weights[2]
ids <- which(binned.len == 3)
W[ids] <- W[ids] * weights[3]
## set missing to NA
ids <- which(binned.len < min.bin)
binned[ids] <- NA
# for removing missing data later
binned.len[ids] <- NA
if (force.positive) n <- 0.5 else n <- 1
## no uncertainty to calculate
if (!uncertainty) {
## catch errors when not enough data to calculate surface
Mgam <- try(gam(binned^n ~ s(u, v, k = k), weights = W), TRUE)
if (!inherits(Mgam, "try-error")) {
pred <- predict.gam(Mgam, input.data)
pred <- pred^(1 / n)
pred <- as.vector(pred)
# interpolate results for speed, but not for clustering
if (extra.args$cluster) {
results <- interp_grid(input.data, z = pred, n = 101)
int <- 101
} else {
results <- interp_grid(input.data, z = pred, n = 201)
int <- 201
}
} else {
results <- data.frame(u = u, v = v, z = binned)
exclude.missing <- FALSE
warning(call. = FALSE, paste("Not enough data to fit surface.\nTry reducing the value of the smoothing parameter, k to less than ", k, ". \nOr use statistic = 'nwr'.", sep = ""))
}
} else {
## uncertainties calculated, weighted by number of points in each bin
Mgam <- gam(binned^n ~ s(u, v, k = k), weights = binned.len)
pred <- predict.gam(Mgam, input.data, se.fit = TRUE)
uncer <- 2 * as.vector(pred[[2]]) ## for approx 95% CI
pred <- as.vector(pred[[1]])^(1 / n)
## do not weight for central prediction
Mgam <- gam(binned^n ~ s(u, v, k = k))
pred <- predict.gam(Mgam, input.data)
pred <- as.vector(pred)
Lower <- (pred - uncer)^(1 / n)
Upper <- (pred + uncer)^(1 / n)
pred <- pred^(1 / n)
n <- length(pred)
# interpolate each uncertainty surface
lower_uncer <- interp_grid(input.data, z = Lower, n = 201) %>% mutate(uncertainty = "lower uncertainty")
upper_uncer <- interp_grid(input.data, z = Upper, n = 201) %>% mutate(uncertainty = "upper uncertainty")
prediction <- interp_grid(input.data, z = pred, n = 201) %>% mutate(uncertainty = "prediction")
results <- bind_rows(prediction, lower_uncer, upper_uncer)
int <- 201
}
## function to remove points too far from original data
exclude <- function(results) {
## exclude predictions too far from data (from mgcv)
x <- seq(-upper, upper, length = int)
y <- x
res <- int
wsp <- rep(x, res)
wdp <- rep(y, rep(res, res))
## data with gaps caused by min.bin
all.data <- na.omit(data.frame(u, v, binned.len))
ind <- with(all.data, exclude.too.far(wsp, wdp, u, v, dist = 0.05))
results$z[ind] <- NA
results
}
if (exclude.missing) results <- exclude(results)
results
}
## if min.bin >1 show the missing data. Work this out by running twice:
## first time with no missings, second with min.bin.
## Plot one surface on top of the other.
if (!missing(min.bin)) {
tmp <- min.bin
min.bin <- 0
res1 <- mydata %>%
group_by(across(type)) %>%
group_modify(~ prepare.grid(.))
min.bin <- tmp
res <- mydata %>%
group_by(across(type)) %>%
group_modify(~ prepare.grid(.))
res$miss <- res1$z
} else {
res <- mydata %>%
group_by(across(type)) %>%
group_modify(~ prepare.grid(.))
}
## with CPF make sure not >1 due to surface fitting
if (any(res$z > 1, na.rm = TRUE) & statistic %in% c("cpf", "cpfi")) {
id <- which(res$z > 1)
res$z[id] <- 1
}
## remove wind speeds > upper to make a circle
if (clip) res$z[(res$u^2 + res$v^2)^0.5 > upper] <- NA
## proper names of labelling
strip.dat <- strip.fun(res, type, auto.text)
strip <- strip.dat[[1]]
strip.left <- strip.dat[[2]]
pol.name <- strip.dat[[3]]
if (uncertainty) strip <- TRUE
## normalise by divining by mean conditioning value if needed
if (normalise) {
res <- mutate(res, z = z / mean(z, na.rm = TRUE))
if (missing(key.footer)) key.footer <- "normalised\nlevel"
}
# correlation notation
if (statistic %in% c("r", "Pearson", "Spearman")) {
if (missing(key.footer)) {
key.footer <- paste0("corr(", pollutant[1], ", ", pollutant[2], ")")
}
# make sure smoothing does not results in r>1 or <-1
# sometimes happens with little data at edges
id <- which(res$z > 1)
if (length(id) > 0) res$z[id] <- 1
id <- which(res$z < -1)
if (length(id) > 0) res$z[id] <- -1
}
# if regression, set key.footer to 'm' (slope)
if (grepl("slope|intercept", statistic) & length(pollutant == 2)) {
key.footer <- "m"
}
# Labels for correlation and regression, keep lower case like other labels
if (statistic %in% c("r", "Pearson")) key.header <- expression(italic("Pearson\ncorrelation"))
if (statistic == "Spearman") key.header <- expression(italic("Spearman\ncorrelation"))
if (statistic == "robust_slope") key.header <- "robust\nslope"
if (statistic == "robust_intercept") key.header <- "robust\nintercept"
if (statistic == "york_slope") key.header <- "York regression\nslope"
if (statistic == "quantile_slope") {
key.header <- paste0("quantile slope\n(tau: ", tau, ")")
}
if (statistic == "quantile_intercept") {
key.header <- paste0("quantile intercept\n(tau: ", tau, ")")
}
## auto-scaling
nlev <- 200 ## preferred number of intervals
## handle missing breaks arguments
if (any(is.null(limits)) | any(is.na(limits))) {
# breaks <- pretty(res$z, n = nlev)
breaks <- seq(
min(res$z, na.rm = TRUE), max(res$z, na.rm = TRUE),
length.out = nlev
)
labs <- pretty(breaks, 7)
labs <- labs[labs >= min(breaks) & labs <= max(breaks)]
at <- labs
} else {
## handle user limits and clipping
breaks <- seq(min(limits), max(limits), length.out = nlev)
labs <- pretty(breaks, 7)
labs <- labs[labs >= min(breaks) & labs <= max(breaks)]
at <- labs
## case where user max is < data max
if (max(limits) < max(res[["z"]], na.rm = TRUE)) {
id <- which(res[["z"]] > max(limits))
res[["z"]][id] <- max(limits)
labs[length(labs)] <- paste(">", labs[length(labs)])
}
## case where user min is > data min
if (min(limits) > min(res[["z"]], na.rm = TRUE)) {
id <- which(res[["z"]] < min(limits))
res[["z"]][id] <- min(limits)
labs[1] <- paste("<", labs[1])
}
}
nlev2 <- length(breaks)
col <- openColours(cols, (nlev2 - 1))
col.scale <- breaks
## special handling of layout for uncertainty
if (uncertainty & is.null(extra.args$layout)) extra.args$layout <- c(3, 1)
legend <- list(
col = col, at = col.scale,
labels = list(labels = labs, at = at),
space = key.position, auto.text = auto.text,
footer = key.footer, header = key.header,
height = 1, width = 1.5, fit = "all"
)
legend <- makeOpenKeyLegend(key, legend, "polarPlot")
## scaling
## scaling of 'zeroed' data
## note - add upper because user can set this to be different to data
intervals <- pretty(c(mydata[[x]], upper))
## labels for scaling
labels <- pretty(c(mydata[[x]], upper) + min.scale)
## offset the lines/labels if necessary
intervals <- intervals + (min(labels) - min.scale)
## add zero in the middle if it exists
if (min.scale != 0) {
labels <- labels[-1]
intervals <- intervals[-1]
}
# if uncertainty = TRUE, change type for plotting (3 panels)
if (uncertainty) {
type <- "uncertainty"
}
temp <- paste(type, collapse = "+")
myform <- formula(paste("z ~ u * v | ", temp, sep = ""))
Args <- list(
x = myform, res, axes = FALSE,
as.table = TRUE,
strip = strip,
strip.left = strip.left,
col.regions = col,
region = TRUE,
aspect = 1,
sub = sub,
par.strip.text = list(cex = 0.8),
scales = list(draw = FALSE),
xlim = c(-upper * 1.025, upper * 1.025),
ylim = c(-upper * 1.025, upper * 1.025),
colorkey = FALSE,
legend = legend,
# add footnote formula if regression used
page = function(n) {
if (formula.label & grepl("slope|intercept", statistic) &
length(pollutant == 2)) {
grid.text(
quickText(paste0(
"Formula: ", pollutant[1], " = m.", pollutant[2], " + c"
)),
x = .99,
y = 0.01,
default.units = "npc",
gp = gpar(fontsize = 10),
just = c("right", "bottom")
)
}
},
panel = function(x, y, z, subscripts, ...) {
## show missing data due to min.bin
if (min.bin > 1) {
panel.levelplot(
x, y, res$miss,
subscripts,
col.regions = mis.col,
labels = FALSE
)
}
panel.levelplot(
x, y, z,
subscripts,
at = col.scale,
pretty = TRUE,
col.regions = col,
labels = FALSE,
alpha.regions = alpha
)
angles <- seq(0, 2 * pi, length = 360)
sapply(intervals, function(x) {
llines(
x * sin(angles), x * cos(angles),
col = "grey", lty = 5
)
})
ltext(
1.07 * intervals * sin(pi * angle.scale / 180),
1.07 * intervals * cos(pi * angle.scale / 180),
sapply(
paste(labels, c("", "", units, rep("", 7))),
function(x) {
quickText(x, auto.text)
}
),
cex = 0.7, pos = 4
)
## add axis line to central polarPlot
lsegments(-upper, 0, upper, 0)
lsegments(0, -upper, 0, upper)
ltext(upper * -1 * 0.95, 0.07 * upper, "W", cex = 0.7)
ltext(0.07 * upper, upper * -1 * 0.95, "S", cex = 0.7)
ltext(0.07 * upper, upper * 0.95, "N", cex = 0.7)
ltext(upper * 0.95, 0.07 * upper, "E", cex = 0.7)
}
)
## reset for extra.args
Args <- listUpdate(Args, extra.args)
plt <- do.call(levelplot, Args)
if (plot) {
if (length(type) == 1) {
plot(plt)
} else {
plot(useOuterStrips(plt,
strip = strip,
strip.left = strip.left
))
}
}
newdata <- res
# return attribute of scale used - useful for data with negative scales such as air_temp
attr(newdata, "radial_scale") <- range(radial_scale)
output <- list(plot = plt, data = newdata, call = match.call())
class(output) <- "openair"
# Final return
invisible(output)
}
# Gaussian bivariate density function
gauss_dens <- function(x, y, mx, my, sx, sy) {
(1 / (2 * pi * sx * sy)) *
exp((-1 / 2) * ((x - mx)^2 / sx^2 + (y - my)^2 / sy^2))
}
# NWR kernel calculations
simple_kernel <- function(data, mydata, x = "ws",
y = "wd", pollutant,
ws_spread, wd_spread, kernel) {
# Centres
ws1 <- data[[1]]
wd1 <- data[[2]]
# centred ws, wd
ws_cent <- mydata[[x]] - ws1
wd_cent <- mydata[[y]] - wd1
wd_cent <- ifelse(wd_cent < -180, wd_cent + 360, wd_cent)
weight <- gauss_dens(ws_cent, wd_cent, 0, 0, ws_spread, wd_spread)
conc <- sum(mydata[[pollutant]] * weight) /
sum(weight)
return(data.frame(conc = conc))
}
# function to to kernel weighting of a quantile regression trend
simple_kernel_trend <- function(data, mydata, x = "ws",
y = "wd", pollutant, date,
ws_spread, wd_spread, kernel, tau) {
# Centres
ws1 <- data[[1]]
wd1 <- data[[2]]
# Gaussian bivariate density function
gauss_dens <- function(x, y, mx, my, sx, sy) {
(1 / (2 * pi * sx * sy)) *
exp((-1 / 2) * ((x - mx)^2 / sx^2 + (y - my)^2 / sy^2))
}
# centred ws, wd
ws_cent <- mydata[[x]] - ws1
wd_cent <- mydata[[y]] - wd1
wd_cent <- ifelse(wd_cent < -180, wd_cent + 360, wd_cent)
weight <- gauss_dens(ws_cent, wd_cent, 0, 0, ws_spread, wd_spread)
weight <- weight / max(weight)
mydata$weight <- weight
# quantreg is a Suggests package, so make sure it is there
try_require("quantreg", "polarPlot")
# don't fit all data - takes too long with no gain
mydata <- filter(mydata, weight > 0.00001)
# Drop dplyr's data frame for formula
mydata <- data.frame(mydata)
# Build model
suppressWarnings(
fit <- try(quantreg::rq(
mydata[[pollutant[1]]] ~ mydata[[date]],
tau = tau,
weights = mydata[["weight"]], method = "fn"
), TRUE)
)
# Extract statistics
if (!inherits(fit, "try-error")) {
# Extract statistics
slope <- 365.25 * 24 * 3600 * fit$coefficients[2]
} else {
slope <- NA
}
return(data.frame(conc = slope))
}
# No export
calculate_weighted_statistics <-
function(data, mydata, statistic, x = "ws",
y = "wd", pol_1, pol_2,
ws_spread, wd_spread, kernel, tau,
x_error, y_error) {
weight <- NULL
# Centres
ws1 <- data[[1]]
wd1 <- data[[2]]
# centred ws, wd
ws_cent <- mydata[[x]] - ws1
wd_cent <- mydata[[y]] - wd1
wd_cent <- ifelse(wd_cent < -180, wd_cent + 360, wd_cent)
weight <- gauss_dens(ws_cent, wd_cent, 0, 0, ws_spread, wd_spread)
weight <- weight / max(weight)
mydata$weight <- weight
# Select and filter
vars <- c(pol_1, pol_2, "weight")
if (!all(is.na(c(x_error, y_error)))) {
vars <- c(vars, x_error, y_error)
}
thedata <- mydata[vars]
# thedata <- thedata[complete.cases(thedata), ]
# don't fit all data - takes too long with no gain
thedata <- thedata[which(thedata$weight > 0.001), ]
# don't try and calculate stats is there's not much data
if (nrow(thedata) < 100) {
return(data.frame(ws1, wd1, NA))
}
# useful for showing what the weighting looks like as a surface
# openair::scatterPlot(mydata, x = "ws", y = "wd", z = "weight", method = "level")
if (statistic %in% c("r", "Pearson", "Spearman")) {
if (statistic == "r") statistic <- "Pearson"
# Weighted Pearson correlation
stat_weighted <- contCorr(thedata[[pol_1]], thedata[[pol_2]],
w = thedata$weight, method = statistic
)
result <- data.frame(ws1, wd1, stat_weighted)
}
# Simple least squared regression with weights
if (statistic %in% c("slope", "intercept")) {
# Drop dplyr's data frame for formula
thedata <- data.frame(thedata)
# Calculate model, no warnings on perfect fits.
fit <- lm(thedata[, pol_1] ~ thedata[, pol_2], weights = thedata[, "weight"])
# Extract statistics
if (statistic == "slope") stat_weighted <- fit$coefficients[2]
if (statistic == "intercept") stat_weighted <- fit$coefficients[1]
# Bind together
result <- data.frame(ws1, wd1, stat_weighted)
}
if (statistic == "york_slope") {
thedata <- as.data.frame(thedata) # so formula works
result <- try(YorkFit(thedata,
X = names(thedata)[2],
Y = names(thedata)[1],
Xstd = x_error, Ystd = y_error,
weight = thedata$weight
), TRUE)
# Extract statistics
if (!inherits(result, "try-error")) {
# Extract statistics
stat_weighted <- result$Slope
} else {
stat_weighted <- NA
}
result <- data.frame(ws1, wd1, stat_weighted)
}
# Robust linear regression with weights
if (grepl("robust", statistic, ignore.case = TRUE)) {
# Drop dplyr's data frame for formula
thedata <- data.frame(thedata)
# Build model, optimal method (MM) cannot use weights
fit <- suppressWarnings(
try(MASS::rlm(
thedata[, pol_1] ~ thedata[, pol_2],
weights = thedata[, "weight"], method = "M"
), TRUE)
)
# Extract statistics
if (!inherits(fit, "try-error")) {
# Extract statistics
if (statistic == "robust_slope") stat_weighted <- fit$coefficients[2]
if (statistic == "robust_intercept") stat_weighted <- fit$coefficients[1]
} else {
if (statistic == "robust_slope") stat_weighted <- NA
if (statistic == "robust_intercept") stat_weighted <- NA
}
# Bind together
result <- data.frame(ws1, wd1, stat_weighted)
}
# Quantile regression with weights
if (grepl("quantile", statistic, ignore.case = TRUE)) {
# quantreg is a Suggests package, so make sure it is there
try_require("quantreg", "polarPlot")
# Drop dplyr's data frame for formula
thedata <- data.frame(thedata)
# Build model
suppressWarnings(
fit <- try(quantreg::rq(
thedata[[pol_1]] ~ thedata[[pol_2]],
tau = tau,
weights = thedata[["weight"]], method = "br"
), TRUE)
)
if (!inherits(fit, "try-error")) {
# Extract statistics
if (statistic == "quantile_slope") stat_weighted <- fit$coefficients[2]
if (statistic == "quantile_intercept") stat_weighted <- fit$coefficients[1]
} else {
if (statistic == "quantile_slope") stat_weighted <- NA
if (statistic == "quantile_intercept") stat_weighted <- NA
}
# Bind together
result <- data.frame(ws1, wd1, stat_weighted)
}
# Return
result
}
# Taken directly from the boot package to save importing
corr <- function(d, w = rep(1, nrow(d)) / nrow(d)) {
s <- sum(w)
m1 <- sum(d[, 1L] * w) / s
m2 <- sum(d[, 2L] * w) / s
(sum(d[, 1L] * d[, 2L] * w) / s - m1 * m2) /
sqrt((sum(d[, 1L]^2 * w) / s - m1^2) * (sum(d[, 2L]^2 * w) / s - m2^2))
}
# From enlightenr package
kernel_smoother <- function(x, kernel = "gaussian") {
kernel <- tolower(kernel)
if (kernel %in% c("gaussian", "normal")) {
x <- (2 * pi)^-0.5 * exp(-0.5 * x^2)
}
if (kernel == "epanechnikov") {
x <- 3 / 4 * (1 - x^2) * indicator_function(x)
}
if (kernel == "logistic") {
x <- 1 / (exp(x) + 2 + exp(-x))
}
if (kernel == "cosine") {
x <- pi / 4 * cos((pi / 2) * x) * indicator_function(x)
}
if (kernel == "triangular") {
x <- (1 - abs(x)) * indicator_function(x)
}
if (kernel %in% c("box", "uniform")) {
x <- 1 / 2 * indicator_function(x)
}
if (kernel == "tricube") {
x <- 70 / 81 * (1 - abs(x)^3)^3 * indicator_function(x)
}
if (kernel == "triweight") {
x <- 35 / 32 * (1 - x^2)^3 * indicator_function(x)
}
if (kernel %in% c("biweight", "quartic")) {
x <- 15 / 16 * (1 - x^2)^2 * indicator_function(x)
}
x
}
indicator_function <- function(x) ifelse(abs(x) <= 1, 1, 0)
# weighted Pearson and Spearman correlations, based on wCorr package
contCorr <- function(x, y, w, method = c("Pearson", "Spearman")) {
if (!is.numeric(x)) {
x <- as.numeric(x)
}
if (!is.numeric(y)) {
y <- as.numeric(y)
}
if (!is.numeric(w)) {
w <- as.numeric(w)
}
pm <- pmatch(tolower(method[[1]]), tolower(c("Pearson", "Spearman")))
if (pm == 2) {
# x <- rank(x) # rank gives averages for ties
# y <- rank(y)
x <- wrank(x, w)
y <- wrank(y, w)
}
xb <- sum(w * x) / sum(w)
yb <- sum(w * y) / sum(w)
numerator <- sum(w * (x - xb) * (y - yb))
denom <- sqrt(sum(w * (x - xb)^2) * sum(w * (y - yb)^2))
numerator / denom
}
wrank <- function(x, w = rep(1, length(x))) {
# sort by x so we can just traverse once
ord <- order(x)
rord <- (1:length(x))[order(ord)] # reverse order
xp <- x[ord] # x, permuted
wp <- w[ord] # weights, permuted
rnk <- rep(NA, length(x)) # blank ranks vector
# setup first itteration
t1 <- 0 # total weight of lower ranked elements
i <- 1 # index
t2 <- 0 # total weight of tied elements (including self)
n <- 0 # number of tied elements
while (i < length(x)) {
t2 <- t2 + wp[i] # tied weight increases by this unit
n <- n + 1 # one additional tied unit
if (xp[i + 1] != xp[i]) { # the next one is not a tie
# find the rank of all tied elements
rnki <- t1 + (n + 1) / (2 * n) * t2
# push that rank to all tied units
for (ii in 1:n) {
rnk[i - ii + 1] <- rnki
}
# reset for next itteration
t1 <- t1 + t2 # new total weight for lower values
t2 <- 0 # new tied weight starts at 0
n <- 0 # no tied units
}
i <- i + 1
}
# final row
n <- n + 1 # add final tie
t2 <- t2 + wp[i] # add final weight to tied weight
rnki <- t1 + (n + 1) / (2 * n) * t2 # final rank
# push that rank to all final tied units
for (ii in 1:n) {
rnk[i - ii + 1] <- rnki
}
# order by incoming index, so put in the original order
rnk[rord]
}
YorkFit <- function(input_data, X = "X", Y = "Y",
Xstd = "Xstd", Ystd = "Ystd",
weight = NA,
Ri = 0, eps = 1e-7) {
# weight can be supplied to apply to errors
tol <- 1e-7 # need to refine
# b0 initial guess at slope for OLR
form <- formula(paste(Y, "~", X))
mod <- lm(form, data = input_data)
b0 <- mod$coefficients[2]
X <- input_data[[X]]
Y <- input_data[[Y]]
Xstd <- input_data[[Xstd]]
Ystd <- input_data[[Ystd]]
# don't try regression if < 3 points
if (sum(!is.na(X)) < 3 || sum(!is.na(Y)) < 3) {
return()
}
# used in polar plots - Gaussian kernel weighting
if (!all(is.na(weight))) {
Xstd <- Xstd / weight
Ystd <- Ystd / weight
}
Xw <- 1 / (Xstd^2) # X weights
Yw <- 1 / (Ystd^2) # Y weights
# ITERATIVE CALCULATION OF SLOPE AND INTERCEPT #
b <- b0
b.diff <- tol + 1
n <- 0 # counter for debugging
while (b.diff > tol && n < 100) {
n <- n + 1 # counter to keep a check on convergence
b.old <- b
alpha.i <- sqrt(Xw * Yw)
Wi <- (Xw * Yw) / ((b^2) * Yw + Xw - 2 * b * Ri * alpha.i)
WiX <- Wi * X
WiY <- Wi * Y
sumWiX <- sum(WiX, na.rm = TRUE)
sumWiY <- sum(WiY, na.rm = TRUE)
sumWi <- sum(Wi, na.rm = TRUE)
Xbar <- sumWiX / sumWi
Ybar <- sumWiY / sumWi
Ui <- X - Xbar
Vi <- Y - Ybar
Bi <- Wi * ((Ui / Yw) + (b * Vi / Xw) - (b * Ui + Vi) * Ri / alpha.i)
wTOPint <- Bi * Wi * Vi
wBOTint <- Bi * Wi * Ui
sumTOP <- sum(wTOPint, na.rm = TRUE)
sumBOT <- sum(wBOTint, na.rm = TRUE)
b <- sumTOP / sumBOT
# zero or problematic data
if (anyNA(b, b.old)) {
return(tibble(
Intercept = NA, Slope = NA,
Int_error = NA, Slope_error = NA,
OLS_slope = NA
))
}
b.diff <- abs(b - b.old)
}
a <- Ybar - b * Xbar
wYorkFitCoefs <- c(a, b)
# ERROR CALCULATION #
Xadj <- Xbar + Bi
WiXadj <- Wi * Xadj
sumWiXadj <- sum(WiXadj, na.rm = TRUE)
Xadjbar <- sumWiXadj / sumWi
Uadj <- Xadj - Xadjbar
wErrorTerm <- Wi * Uadj * Uadj
errorSum <- sum(wErrorTerm, na.rm = TRUE)
b.err <- sqrt(1 / errorSum)
a.err <- sqrt((1 / sumWi) + (Xadjbar^2) * (b.err^2))
wYorkFitErrors <- c(a.err, b.err)
# GOODNESS OF FIT CALCULATION #
lgth <- length(X)
wSint <- Wi * (Y - b * X - a)^2
sumSint <- sum(wSint, na.rm = TRUE)
wYorkGOF <- c(sumSint / (lgth - 2), sqrt(2 / (lgth - 2))) # GOF (should equal 1 if assumptions are valid), #standard error in GOF
ans <- tibble(
Intercept = a, Slope = b,
Int_error = a.err, Slope_error = b.err,
OLS_slope = b0
)
return(ans)
}
# bi-linear interpolation on a regular grid
# allows surface predictions at a low resolution to be interpolated, rather than using a GAM
interp.surface <- function(obj, loc) {
# obj is a surface or image object like the list for contour, persp or image.
# loc a matrix of 2 d locations -- new points to evaluate the surface.
x <- obj$x
y <- obj$y
z <- obj$z
nx <- length(x)
ny <- length(y)
# this clever idea for finding the intermediate coordinates at the new points
# is from J-O Irisson
lx <- approx(x, 1:nx, loc[, 1])$y
ly <- approx(y, 1:ny, loc[, 2])$y
lx1 <- floor(lx)
ly1 <- floor(ly)
# x and y distances between each new point and the closest grid point in the lower left hand corner.
ex <- lx - lx1
ey <- ly - ly1
# fix up weights to handle the case when loc are equal to
# last grid point. These have been set to NA above.
ex[lx1 == nx] <- 1
ey[ly1 == ny] <- 1
lx1[lx1 == nx] <- nx - 1
ly1[ly1 == ny] <- ny - 1
# bilinear interpolation finds simple weights based on the
# the four corners of the grid box containing the new
# points.
return(z[cbind(lx1, ly1)] * (1 - ex) * (1 - ey) + z[cbind(lx1 +
1, ly1)] * ex * (1 - ey) + z[cbind(lx1, ly1 + 1)] * (1 -
ex) * ey + z[cbind(lx1 + 1, ly1 + 1)] * ex * ey)
}
# function to do bilinear interpolation given input grid and number of points required
interp_grid <- function(input.data, x = "u", y = "v", z, n = 201) {
# current number of points
int <- length(unique(input.data[[x]]))
obj <- list(
x = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = int),
y = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = int),
z = matrix(z, nrow = int)
)
loc <- expand.grid(
x = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = n),
y = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = n)
)
res.interp <- interp.surface(obj, loc)
out <- expand.grid(
u = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = n),
v = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = n)
)
results <- data.frame(out, z = res.interp)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.