R/polarPlot.R

Defines functions interp_grid interp_surface YorkFit wrank cont_corr calc_quantile_stat calc_robust_stat calc_york_stat calc_lm_stat calc_corr_stat calculate_weighted_statistics gauss_dens polar_prepare_grid polarPlot

Documented in polarPlot

#' Function for plotting bivariate polar plots with smoothing.
#'
#' Function for plotting pollutant concentration in polar coordinates showing
#' concentration by wind speed (or another numeric variable) and direction. Mean
#' concentrations are calculated for wind speed-direction \sQuote{bins} (e.g.
#' 0-1, 1-2 m/s,...  and 0-10, 10-20 degrees etc.).  To aid interpretation,
#' `gam` smoothing is carried out using `mgcv`.
#'
#' The bivariate polar plot is a useful diagnostic tool for quickly gaining an
#' idea of potential sources. Wind speed is one of the most useful variables to
#' use to separate source types (see references). For example, ground-level
#' concentrations resulting from buoyant plumes from chimney stacks tend to peak
#' under higher wind speed conditions. Conversely, ground-level, non-buoyant
#' plumes such as from road traffic, tend to have highest concentrations under
#' low wind speed conditions. Other sources such as from aircraft engines also
#' show differing characteristics by wind speed.
#'
#' The function has been developed to allow variables other than wind speed to
#' be plotted with wind direction in polar coordinates. The key issue is that
#' the other variable plotted against wind direction should be discriminating in
#' some way. For example, temperature can help reveal high-level sources brought
#' down to ground level in unstable atmospheric conditions, or show the effect a
#' source emission dependent on temperature e.g. biogenic isoprene.
#'
#' The plots can vary considerably depending on how much smoothing is done.  The
#' approach adopted here is based on the very flexible and capable `mgcv`
#' package that uses *Generalized Additive Models*. While methods do exist to
#' find an optimum level of smoothness, they are not necessarily useful. The
#' principal aim of `polarPlot` is as a graphical analysis rather than for
#' quantitative purposes. In this respect the smoothing aims to strike a balance
#' between revealing interesting (real) features and overly noisy data. The
#' defaults used in [polarPlot()] are based on the analysis of data from many
#' different sources. More advanced users may wish to modify the code and adopt
#' other smoothing approaches.
#'
#' Various statistics are possible to consider e.g. mean, maximum, median.
#' `statistic = "max"` is often useful for revealing sources. Pair-wise
#' statistics between two pollutants can also be calculated.
#'
#' The function can also be used to compare two pollutant species through a
#' range of pair-wise statistics (see help on `statistic`) and Grange et al.
#' (2016) (open-access publication link below).
#'
#' Wind direction is split up into 10 degree intervals and the other variable
#' (e.g. wind speed) 30 intervals. These 2D bins are then used to calculate the
#' statistics.
#'
#' These plots often show interesting features at higher wind speeds (see
#' references below). For these conditions there can be very few measurements
#' and therefore greater uncertainty in the calculation of the surface. There
#' are several ways in which this issue can be tackled. First, it is possible to
#' avoid smoothing altogether and use [polarFreq()]. Second, the effect of
#' setting a minimum number of measurements in each wind speed-direction bin can
#' be examined through `min.bin`. It is possible that a single point at high
#' wind speed conditions can strongly affect the surface prediction. Therefore,
#' setting `min.bin = 3`, for example, will remove all wind speed-direction bins
#' with fewer than 3 measurements *before* fitting the surface. Third, consider
#' setting `uncertainty = TRUE`. This option will show the predicted surface
#' together with upper and lower 95% confidence intervals, which take account of
#' the frequency of measurements.
#'
#' @inheritParams shared_openair_params
#'
#' @param mydata A data frame minimally containing `wd`, another variable to
#'   plot in polar coordinates (the default is a column \dQuote{ws} --- wind
#'   speed) and a pollutant. Should also contain `date` if plots by time period
#'   are required.
#'
#' @param pollutant Mandatory. A pollutant name corresponding to a variable in a
#'   data frame should be supplied e.g. `pollutant = "nox"`. There can also be
#'   more than one pollutant specified e.g. `pollutant = c("nox", "no2")`. The
#'   main use of using two or more pollutants is for model evaluation where two
#'   species would be expected to have similar concentrations. This saves the
#'   user stacking the data and it is possible to work with columns of data
#'   directly. A typical use would be `pollutant = c("obs", "mod")` to compare
#'   two columns \dQuote{obs} (the observations) and \dQuote{mod} (modelled
#'   values). When pair-wise statistics such as Pearson correlation and
#'   regression techniques are to be plotted, `pollutant` takes two elements
#'   too. For example, `pollutant = c("bc", "pm25")` where `"bc"` is a function
#'   of `"pm25"`.
#'
#' @param x Name of variable to plot against wind direction in polar
#'   coordinates, the default is wind speed, \dQuote{ws}.
#'
#' @param wd Name of wind direction field.
#'
#' @param statistic The statistic that should be applied to each wind
#'   speed/direction bin. Because of the smoothing involved, the colour scale
#'   for some of these statistics is only to provide an indication of overall
#'   pattern and should not be interpreted in concentration units e.g. for
#'   `statistic = "weighted.mean"` where the bin mean is multiplied by the bin
#'   frequency and divided by the total frequency. In many cases using
#'   `polarFreq` will be better. Setting `statistic = "weighted.mean"` can be
#'   useful because it provides an indication of the concentration * frequency
#'   of occurrence and will highlight the wind speed/direction conditions that
#'   dominate the overall mean.Can be:
#'
#'   \itemize{ \item  \dQuote{mean} (default), \dQuote{median}, \dQuote{max}
#'   (maximum), \dQuote{frequency}. \dQuote{stdev} (standard deviation),
#'   \dQuote{weighted.mean}.
#'
#'   \item `statistic = "nwr"` Implements the Non-parametric Wind
#'   Regression approach of Henry et al. (2009) that uses kernel smoothers. The
#'   `openair` implementation is not identical because Gaussian kernels are
#'   used for both wind direction and speed. The smoothing is controlled by
#'   `ws_spread` and `wd_spread`.
#'
#'   \item `statistic = "cpf"` the conditional probability function (CPF)
#'   is plotted and a single (usually high) percentile level is supplied. The
#'   CPF is defined as CPF = my/ny, where my is the number of samples in the y
#'   bin (by default a wind direction, wind speed interval) with mixing ratios
#'   greater than the *overall* percentile concentration, and ny is the
#'   total number of samples in the same wind sector (see Ashbaugh et al.,
#'   1985). Note that percentile intervals can also be considered; see
#'   `percentile` for details.
#'
#'   \item When `statistic = "r"` or `statistic = "Pearson"`, the
#'   Pearson correlation coefficient is calculated for *two* pollutants.
#'   The calculation involves a weighted Pearson correlation coefficient, which
#'   is weighted by Gaussian kernels for wind direction an the radial variable
#'   (by default wind speed). More weight is assigned to values close to a wind
#'   speed-direction interval. Kernel weighting is used to ensure that all data
#'   are used rather than relying on the potentially small number of values in a
#'   wind speed-direction interval.
#'
#'   \item When `statistic = "Spearman"`, the Spearman correlation
#'   coefficient is calculated for *two* pollutants. The calculation
#'   involves a weighted Spearman correlation coefficient, which is weighted by
#'   Gaussian kernels for wind direction an the radial variable (by default wind
#'   speed). More weight is assigned to values close to a wind speed-direction
#'   interval. Kernel weighting is used to ensure that all data are used rather
#'   than relying on the potentially small number of values in a wind
#'   speed-direction interval.
#'
#'   \item `"robust_slope"` is another option for pair-wise statistics and
#'   `"quantile.slope"`, which uses quantile regression to estimate the
#'   slope for a particular quantile level (see also `tau` for setting the
#'   quantile level).
#'
#'   \item `"york_slope"` is another option for pair-wise statistics which
#'   uses the *York regression method* to estimate the slope. In this
#'   method the uncertainties in `x` and `y` are used in the
#'   determination of the slope. The uncertainties are provided by
#'   `x_error` and `y_error` --- see below.}
#'
#' @param limits The function does its best to choose sensible limits
#'   automatically. However, there are circumstances when the user will wish to
#'   set different ones. An example would be a series of plots showing each year
#'   of data separately. The limits are set in the form `c(lower, upper)`, so
#'   `limits = c(0, 100)` would force the plot limits to span 0-100.
#'
#' @param exclude.missing Setting this option to `TRUE` (the default) removes
#'   points from the plot that are too far from the original data. The smoothing
#'   routines will produce predictions at points where no data exist i.e. they
#'   predict. By removing the points too far from the original data produces a
#'   plot where it is clear where the original data lie. If set to `FALSE`
#'   missing data will be interpolated.
#'
#' @param uncertainty Should the uncertainty in the calculated surface be shown?
#'   If `TRUE` three plots are produced on the same scale showing the predicted
#'   surface together with the estimated lower and upper uncertainties at the
#'   95% confidence interval. Calculating the uncertainties is useful to
#'   understand whether features are real or not.  For example, at high wind
#'   speeds where there are few data there is greater uncertainty over the
#'   predicted values. The uncertainties are calculated using the GAM and
#'   weighting is done by the frequency of measurements in each wind
#'   speed-direction bin. Note that if uncertainties are calculated then the
#'   type is set to "default".
#'
#' @param percentile If `statistic = "percentile"` then `percentile` is used,
#'   expressed from 0 to 100. Note that the percentile value is calculated in
#'   the wind speed, wind direction \sQuote{bins}. For this reason it can also
#'   be useful to set `min.bin` to ensure there are a sufficient number of
#'   points available to estimate a percentile. See `quantile` for more details
#'   of how percentiles are calculated.
#'
#'   `percentile` is also used for the Conditional Probability Function (CPF)
#'   plots. `percentile` can be of length two, in which case the percentile
#'   *interval* is considered for use with CPF. For example, `percentile = c(90,
#'   100)` will plot the CPF for concentrations between the 90 and 100th
#'   percentiles. Percentile intervals can be useful for identifying specific
#'   sources. In addition, `percentile` can also be of length 3. The third value
#'   is the \sQuote{trim} value to be applied. When calculating percentile
#'   intervals many can cover very low values where there is no useful
#'   information. The trim value ensures that values greater than or equal to
#'   the trim * mean value are considered *before* the percentile intervals are
#'   calculated. The effect is to extract more detail from many source
#'   signatures. See the manual for examples. Finally, if the trim value is less
#'   than zero the percentile range is interpreted as absolute concentration
#'   values and subsetting is carried out directly.
#'
#' @param weights At the edges of the plot there may only be a few data points
#'   in each wind speed-direction interval, which could in some situations
#'   distort the plot if the concentrations are high. `weights` applies a
#'   weighting to reduce their influence. For example and by default if only a
#'   single data point exists then the weighting factor is 0.25 and for two
#'   points 0.5. To not apply any weighting and use the data as is, use `weights
#'   = c(1, 1, 1)`.
#'
#'   An alternative to down-weighting these points they can be removed
#'   altogether using `min.bin`.
#'
#' @param min.bin The minimum number of points allowed in a wind speed/wind
#'   direction bin.  The default is 1. A value of two requires at least 2 valid
#'   records in each bin an so on; bins with less than 2 valid records are set
#'   to NA. Care should be taken when using a value > 1 because of the risk of
#'   removing real data points. It is recommended to consider your data with
#'   care. Also, the `polarFreq` function can be of use in such circumstances.
#'
#' @param mis.col When `min.bin` is > 1 it can be useful to show where data are
#'   removed on the plots. This is done by shading the missing data in
#'   `mis.col`. To not highlight missing data when `min.bin` > 1 choose `mis.col
#'   = "transparent"`.
#'
#' @param upper This sets the upper limit wind speed to be used. Often there are
#'   only a relatively few data points at very high wind speeds and plotting all
#'   of them can reduce the useful information in the plot.
#'
#' @param units The units shown on the polar axis scale.
#'
#' @param force.positive The default is `TRUE`. Sometimes if smoothing data with
#'   steep gradients it is possible for predicted values to be negative.
#'   `force.positive = TRUE` ensures that predictions remain positive. This is
#'   useful for several reasons. First, with lots of missing data more
#'   interpolation is needed and this can result in artefacts because the
#'   predictions are too far from the original data. Second, if it is known
#'   beforehand that the data are all positive, then this option carries that
#'   assumption through to the prediction. The only likely time where setting
#'   `force.positive = FALSE` would be if background concentrations were first
#'   subtracted resulting in data that is legitimately negative. For the vast
#'   majority of situations it is expected that the user will not need to alter
#'   the default option.
#'
#' @param k This is the smoothing parameter used by the `gam` function in
#'   package `mgcv`. Typically, value of around 100 (the default) seems to be
#'   suitable and will resolve important features in the plot. The most
#'   appropriate choice of `k` is problem-dependent; but extensive testing of
#'   polar plots for many different problems suggests a value of `k` of about
#'   100 is suitable. Setting `k` to higher values will not tend to affect the
#'   surface predictions by much but will add to the computation time. Lower
#'   values of `k` will increase smoothing. Sometimes with few data to plot
#'   `polarPlot` will fail. Under these circumstances it can be worth lowering
#'   the value of `k`.
#'
#' @param normalise If `TRUE` concentrations are normalised by dividing by their
#'   mean value. This is done *after* fitting the smooth surface. This option is
#'   particularly useful if one is interested in the patterns of concentrations
#'   for several pollutants on different scales e.g. NOx and CO. Often useful if
#'   more than one `pollutant` is chosen.
#'
#' @param ws_spread The value of sigma used for Gaussian kernel weighting of
#'   wind speed when `statistic = "nwr"` or when correlation and regression
#'   statistics are used such as *r*. Default is `0.5`.
#'
#' @param wd_spread The value of sigma used for Gaussian kernel weighting of
#'   wind direction when `statistic = "nwr"` or when correlation and regression
#'   statistics are used such as *r*. Default is `4`.
#'
#' @param x_error The `x` error / uncertainty used when `statistic =
#'   "york_slope"`.
#'
#' @param y_error The `y` error / uncertainty used when `statistic =
#'   "york_slope"`.
#'
#' @param kernel Type of kernel used for the weighting procedure for when
#'   correlation or regression techniques are used. Only `"gaussian"` is
#'   supported but this may be enhanced in the future.
#'
#' @param formula.label When pair-wise statistics such as regression slopes are
#'   calculated and plotted, should a formula label be displayed?
#'   `formula.label` will also determine whether concentration information is
#'   printed when `statistic = "cpf"`.
#'
#' @param tau The quantile to be estimated when `statistic` is set to
#'   `"quantile.slope"`. Default is `0.5` which is equal to the median and will
#'   be ignored if `"quantile.slope"` is not used.
#'
#' @param ... Addition options are passed on to [cutData()] for `type` handling.
#'   Some additional arguments are also available:
#'   - `xlab`, `ylab` and `main` override the x-axis label, y-axis label, and plot title.
#'   - `layout` sets the layout of facets - e.g., `layout(2, 5)` will have 2 columns and 5 rows.
#'   - `fontsize` overrides the overall font size of the plot.
#'   - `annotate = FALSE` will not plot the N/E/S/W labels.
#'
#' @return an [openair][openair-package] object. `data` contains four set
#'   columns: `cond`, conditioning based on `type`; `u` and `v`, the
#'   translational vectors based on `ws` and `wd`; and the local `pollutant`
#'   estimate.
#' @author David Carslaw
#' @family polar directional analysis functions
#' @references
#'
#' Ashbaugh, L.L., Malm, W.C., Sadeh, W.Z., 1985. A residence time probability
#' analysis of sulfur concentrations at ground canyon national park. Atmospheric
#' Environment 19 (8), 1263-1270.
#'
#' Carslaw, D.C., Beevers, S.D, Ropkins, K and M.C. Bell (2006). Detecting and
#' quantifying aircraft and other on-airport contributions to ambient nitrogen
#' oxides in the vicinity of a large international airport.  Atmospheric
#' Environment. 40/28 pp 5424-5434.
#'
#' Carslaw, D.C., & Beevers, S.D. (2013). Characterising and understanding
#' emission sources using bivariate polar plots and k-means clustering.
#' Environmental Modelling & Software, 40, 325-329. DOI:
#' 10.1016/j.envsoft.2012.09.005.
#'
#' Henry, R.C., Chang, Y.S., Spiegelman, C.H., 2002. Locating nearby sources of
#' air pollution by nonparametric regression of atmospheric concentrations on
#' wind direction. Atmospheric Environment 36 (13), 2237-2244.
#'
#' Henry, R., Norris, G.A., Vedantham, R., Turner, J.R., 2009. Source region
#' identification using Kernel smoothing. Environ. Sci. Technol. 43 (11),
#' 4090e4097. DOI: 10.1021/es8011723.
#'
#' Uria-Tellaetxe, I. and D.C. Carslaw (2014). Source identification using a
#' conditional bivariate Probability function. Environmental Modelling &
#' Software, Vol. 59, 1-9.
#'
#' Westmoreland, E.J., N. Carslaw, D.C. Carslaw, A. Gillah and E. Bates (2007).
#' Analysis of air quality within a street canyon using statistical and
#' dispersion modelling techniques. Atmospheric Environment. Vol.  41(39), pp.
#' 9195-9205.
#'
#' Yu, K.N., Cheung, Y.P., Cheung, T., Henry, R.C., 2004. Identifying the impact
#' of large urban airports on local air quality by nonparametric regression.
#' Atmospheric Environment 38 (27), 4501-4507.
#'
#' Grange, S. K., Carslaw, D. C., & Lewis, A. C. 2016. Source apportionment
#' advances with bivariate polar plots, correlation, and regression techniques.
#' Atmospheric Environment. 145, 128-134. DOI: 10.1016/j.atmosenv.2016.09.016.
#' @examples
#'
#' # basic plot
#' polarPlot(mydata, pollutant = "nox")
#' \dontrun{
#'
#' # polarPlots by year on same scale
#' polarPlot(mydata, pollutant = "so2", type = "year", main = "polarPlot of so2")
#'
#' # set minimum number of bins to be used to see if pattern remains similar
#' polarPlot(mydata, pollutant = "nox", min.bin = 3)
#'
#' # plot by day of the week
#' polarPlot(mydata, pollutant = "pm10", type = "weekday")
#'
#' # show the 95% confidence intervals in the surface fitting
#' polarPlot(mydata, pollutant = "so2", uncertainty = TRUE)
#'
#'
#' # Pair-wise statistics
#' # Pearson correlation
#' polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "r")
#'
#' # Robust regression slope, takes a bit of time
#' polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "robust.slope")
#'
#' # Least squares regression works too but it is not recommended, use robust
#' # regression
#' # polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "slope")
#' }
#'
#' @export
polarPlot <-
  function(
    mydata,
    pollutant = "nox",
    x = "ws",
    wd = "wd",
    type = "default",
    statistic = "mean",
    limits = NULL,
    exclude.missing = TRUE,
    uncertainty = FALSE,
    percentile = NA,
    cols = "default",
    weights = c(0.25, 0.5, 0.75),
    min.bin = 1,
    mis.col = "grey",
    upper = NA,
    angle.scale = 315,
    units = x,
    force.positive = TRUE,
    k = 100,
    normalise = FALSE,
    key.title = paste(statistic, pollutant, sep = " "),
    key.position = "right",
    strip.position = "top",
    auto.text = TRUE,
    ws_spread = 1.5,
    wd_spread = 5,
    x_error = NA,
    y_error = NA,
    kernel = "gaussian",
    formula.label = TRUE,
    tau = 0.5,
    plot = TRUE,
    key = NULL,
    ...
  ) {
    if (
      statistic == "percentile" && is.na(percentile[1] && statistic != "cpf")
    ) {
      warning("percentile value missing, using 50")
      percentile <- 50
    }

    if (statistic == "cpf" && is.na(percentile[1])) {
      warning("percentile value missing, using 75")
      percentile <- 75
    }

    # Allow for period notation
    statistic <- gsub("\\.| ", "_", statistic)

    # Build vector for many checks
    correlation_stats <- c(
      "r",
      "slope",
      "intercept",
      "robust_slope",
      "robust_intercept",
      "quantile_slope",
      "quantile_intercept",
      "Pearson",
      "Spearman",
      "york_slope"
    )

    if (statistic %in% correlation_stats && length(pollutant) != 2) {
      stop("Correlation statistic requires two pollutants.")
    }

    if (length(type) > 2) {
      stop("Maximum number of types is 2.")
    }

    ## can't have conditioning here
    if (uncertainty) {
      type <- "default"
    }

    if (uncertainty && length(pollutant) > 1) {
      stop("Can only have one pollutant when uncertainty = TRUE")
    }

    if (
      !statistic %in%
        c(
          "mean",
          "median",
          "frequency",
          "max",
          "stdev",
          "weighted_mean",
          "percentile",
          "cpf",
          "nwr",
          correlation_stats
        )
    ) {
      stop("statistic '", statistic, "' not recognised.")
    }

    if (length(weights) != 3) {
      stop("weights should be of length 3.")
    }

    # check key.position
    key.position <- check_key_position(key.position, key)

    ## extra.args setup
    extra.args <- list(...)

    ## label controls
    extra.args$xlab <- if ("xlab" %in% names(extra.args)) {
      quickText(extra.args$xlab, auto.text)
    } else {
      quickText("", auto.text)
    }

    extra.args$ylab <- if ("ylab" %in% names(extra.args)) {
      quickText(extra.args$ylab, auto.text)
    } else {
      quickText("", auto.text)
    }

    extra.args$main <- if ("main" %in% names(extra.args)) {
      quickText(extra.args$main, auto.text)
    } else {
      quickText("", auto.text)
    }

    # if clustering, return lower resolution plot
    extra.args$cluster <- if ("cluster" %in% names(extra.args)) {
      TRUE
    } else {
      FALSE
    }

    ## need to initialise key_header & key_footer for later combination
    key_header <- statistic
    key_footer <- paste(pollutant, collapse = ", ")

    ## layout default
    if (!"layout" %in% names(extra.args)) {
      extra.args$layout <- NULL
    }

    ## extract variables of interest
    vars <- c(wd, x, pollutant)

    if (statistic == "york_slope") {
      vars <- c(vars, x_error, y_error)
    }

    if (any(type %in% dateTypes)) {
      vars <- c(vars, "date")
    }

    mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)

    mydata <- stats::na.omit(mydata)

    # check to see if a high proportion of the data is negative
    if (
      length(which(mydata[pollutant] < 0)) / nrow(mydata) > 0.1 &&
        force.positive
    ) {
      warning(">10% negative data detected, set force.positive = FALSE?")
    }

    # scale data by subtracting the min value this helps with dealing
    # with negative data on radial axis (starts from zero, always
    # positive)

    # return radial scale as attribute "radial_scale" on returned data for external plotting
    radial_scale <- range(mydata[[x]])

    mydata[[x]] <- mydata[[x]] - min(mydata[[x]], na.rm = TRUE)

    # if more than one pollutant, need to stack the data and set type =
    # "variable" this case is most relevent for model-measurement
    # compasrions where data are in columns Can also do more than one
    # pollutant and a single type that is not "default", in which case
    # pollutant becomes a conditioning variable

    # if statistic is 'r' we need the data for two pollutants in columns
    if (length(pollutant) > 1 && !statistic %in% correlation_stats) {
      if (length(type) > 1) {
        cli::cli_warn("Only {.arg type} = '{type[1]}' will be used.")
        type <- type[1]
      }

      ## use pollutants as conditioning variables
      mydata <- tidyr::gather(
        mydata,
        key = "variable",
        value = "value",
        dplyr::all_of(pollutant),
        factor_key = TRUE
      )
      ## now set pollutant to "value"
      pollutant <- "value"

      if (type == "default") {
        type <- "variable"
      } else {
        type <- c(type, "variable")
      }
    }

    ## cutData depending on type
    mydata <- cutData(mydata, type, ...)

    ## if upper ws not set, set it to the max to display all information
    max.ws <- max(mydata[[x]], na.rm = TRUE)
    min.ws <- min(mydata[[x]], na.rm = TRUE)
    clip <- TRUE ## used for removing data where ws > upper

    if (is.na(upper)) {
      upper <- max.ws
      clip <- FALSE
    }

    # resolution deprecated, int is resolution of GAM surface predictions over int * int grid
    # 51 works well with bilinear interpolation of results

    int <- 51

    ## binning wd data properly
    ## use 10 degree binning of wd if already binned, else 5
    if (all(mydata[[wd]] %% 10 == 0, na.rm = TRUE)) {
      wd.int <- 10
    } else {
      if (toupper(statistic) == "NWR") wd.int <- 2 else wd.int <- 5 ## how to split wd
    }

    if (toupper(statistic) == "NWR") {
      ws_bins <- 40
    } else {
      ws_bins <- 30
    }

    if (statistic == "nwr") {
      k <- 200
    } # limit any smoothing

    ws.seq <- seq(min.ws, max.ws, length = ws_bins)
    wd.seq <- seq(from = wd.int, to = 360, by = wd.int) ## wind directions from wd.int to 360
    ws.wd <- expand.grid(x = ws.seq, wd = wd.seq)

    u <- with(ws.wd, x * sin(pi * wd / 180)) ## convert to polar coords
    v <- with(ws.wd, x * cos(pi * wd / 180))

    ## data to predict over
    input.data <- expand.grid(
      u = seq(-upper, upper, length = int),
      v = seq(-upper, upper, length = int)
    )

    ## Pval is only set in the CPF branch; initialise to NULL so it always exists
    Pval <- NULL

    if (statistic == "cpf") {
      ## can be interval of percentiles or a single (threshold)
      if (length(percentile) > 1) {
        statistic <- "cpfi" # CPF interval

        if (length(percentile) == 3) {
          ## in this case there is a trim value as a proprtion of the mean
          ## if this value <0 use absolute values as range
          Mean <- mean(mydata[[pollutant]], na.rm = TRUE)

          if (percentile[3] < 0) {
            Pval <- percentile[1:2]
          } else {
            Pval <- stats::quantile(
              subset(
                mydata[[pollutant]],
                mydata[[pollutant]] >=
                  Mean *
                    percentile[3]
              ),
              probs = percentile[1:2] / 100,
              na.rm = TRUE
            )
          }
        } else {
          Pval <- stats::quantile(
            mydata[[pollutant]],
            probs = percentile / 100,
            na.rm = TRUE
          )
        }

        sub <- paste0(
          "CPF (",
          format(Pval[1], digits = 2),
          " to ",
          format(Pval[2], digits = 2),
          ")"
        )
      } else {
        Pval <- stats::quantile(
          mydata[[pollutant]],
          probs = percentile / 100,
          na.rm = TRUE
        )

        sub <- paste0(
          "CPF at the ",
          percentile,
          "th percentile (=",
          format(Pval, digits = 2),
          ")"
        )
        if (!formula.label) sub <- NULL # no label
      }
    } else {
      sub <- NULL
    }

    ## Thin wrapper — delegates to polar_prepare_grid() with all dependencies
    ## passed explicitly rather than captured from the enclosing scope.
    prepare.grid <- function(mydata) {
      polar_prepare_grid(
        mydata = mydata,
        wd_col = wd,
        x_col = x,
        pollutant = pollutant,
        statistic = statistic,
        correlation_stats = correlation_stats,
        Pval = Pval,
        percentile = percentile,
        ws.wd = ws.wd,
        u = u,
        v = v,
        input.data = input.data,
        ws_bins = ws_bins,
        wd.int = wd.int,
        max.ws = max.ws,
        weights = weights,
        min.bin = min.bin,
        force.positive = force.positive,
        uncertainty = uncertainty,
        exclude.missing = exclude.missing,
        upper = upper,
        k = k,
        ws_spread = ws_spread,
        wd_spread = wd_spread,
        kernel = kernel,
        tau = tau,
        x_error = x_error,
        y_error = y_error,
        cluster = extra.args$cluster
      )
    }

    ## if min.bin >1 show the missing data. Work this out by running twice:
    ## first time with no missings, second with min.bin.
    ## Plot one surface on top of the other.

    if (!missing(min.bin)) {
      tmp <- min.bin
      min.bin <- 0
      res1 <-
        map_type(
          mydata,
          type,
          prepare.grid,
          .include_default = TRUE
        )

      min.bin <- tmp

      res <-
        map_type(
          mydata,
          type,
          prepare.grid,
          .include_default = TRUE
        )

      res$miss <- res1$z
    } else {
      res <- map_type(
        mydata,
        type,
        prepare.grid,
        .include_default = TRUE
      )
    }

    ## with CPF make sure not >1 due to surface fitting
    if (any(res$z > 1, na.rm = TRUE) && statistic %in% c("cpf", "cpfi")) {
      id <- which(res$z > 1)
      res$z[id] <- 1
    }

    ## remove wind speeds > upper to make a circle
    if (clip) {
      res$z[(res$u^2 + res$v^2)^0.5 > upper] <- NA
    }

    ## normalise by divining by mean conditioning value if needed
    if (normalise) {
      res <- dplyr::mutate(
        res,
        z = .data$z / mean(.data$z, na.rm = TRUE),
        .by = dplyr::all_of(type)
      )

      key_footer <- "normalised level"
    }

    # correlation notation
    if (statistic %in% c("r", "Pearson", "Spearman")) {
      key_footer <- paste0("corr(", pollutant[1], ", ", pollutant[2], ")")

      # make sure smoothing does not results in r>1 or <-1
      # sometimes happens with little data at edges
      id <- which(res$z > 1)
      if (length(id) > 0) {
        res$z[id] <- 1
      }

      id <- which(res$z < -1)
      if (length(id) > 0) res$z[id] <- -1
    }

    # if regression, set key_footer to 'm' (slope)
    if (grepl("slope|intercept", statistic) && length(pollutant) == 2) {
      key_footer <- "m"
    }

    # Labels for correlation and regression, keep lower case like other labels
    if (missing(key.title)) {
      if (key_header == "nwr") {
        key_header <- "NWR"
      }
      if (key_header == "weighted_mean") {
        key_header <- "weighted mean"
      }
      if (key_header == "percentile") {
        key_header <- c(paste0(percentile, "th"), "percentile")
      }

      if ("cpf" %in% key_header) {
        key_header <- c("CPF probability")
      }

      if (statistic %in% c("r", "Pearson")) {
        key_header <- ("Pearson correlation")
      }

      if (statistic == "Spearman") {
        key_header <- "Spearman correlation"
      }

      if (statistic == "robust_slope") {
        key_header <- "robust slope"
      }

      if (statistic == "robust_intercept") {
        key_header <- "robust intercept"
      }

      if (statistic == "york_slope") {
        key_header <- "York regression slope"
      }

      if (statistic == "quantile_slope") {
        key_header <- paste0("quantile slope (tau: ", tau, ")")
      }

      if (statistic == "quantile_intercept") {
        key_header <- paste0("quantile intercept (tau: ", tau, ")")
      }
    }

    ## special handling of layout for uncertainty
    if (uncertainty && is.null(extra.args$layout)) {
      extra.args$layout <- c(3, 1)
    }

    # if uncertainty = TRUE, change type for plotting (3 panels)
    if (uncertainty) {
      type <- "uncertainty"
    }

    # Format legend title
    if (missing(key.title)) {
      key.title <- paste(
        key_header,
        key_footer,
        sep = ifelse(key.position %in% c("top", "bottom"), " ", "\n")
      )
    }

    # check if key.header / key.footer are being used
    key.title <- check_key_header(key.title, extra.args)

    # run through
    legend_title <- quickText(key.title, auto.text)

    # restore to polar coordinates
    plot_data <-
      res |>
      dplyr::mutate(
        x = sqrt(.data$u^2 + .data$v^2),
        wd = 270 - atan2(.data$v, .data$u) * (180 / pi),
        wd = (wd + 180) %% 360
      )

    # if using the min.bin arg, just drop missing z that's also missing "miss".
    # otherwise drop all of missing z
    if ("miss" %in% names(plot_data)) {
      plot_data <- dplyr::filter(
        plot_data,
        !(is.na(.data$z) & is.na(.data$miss))
      ) |>
        dplyr::select(!"miss")
    } else {
      plot_data <- tidyr::drop_na(plot_data, "z")
    }

    # restore scale pre-normalisation
    plot_data <-
      dplyr::mutate(
        plot_data,
        x = scales::rescale(
          .data$x,
          to = radial_scale,
          from = c(0, diff(radial_scale))
        )
      )

    # Normalise limits: must be NULL or a numeric vector of length 2
    if (!is.null(limits) && !(is.numeric(limits) && length(limits) == 2)) {
      limits <- NULL
    }

    thePlot <-
      plot_data |>
      dplyr::arrange(!is.na(.data$z), .data$z) |>
      ggplot2::ggplot(ggplot2::aes(x = .data$wd, y = .data$x)) +
      ggplot2::geom_point(ggplot2::aes(colour = .data$z)) +
      ggplot2::ggproto(
        NULL,
        ggplot2::coord_radial(r.axis.inside = angle.scale),
        inner_radius = c(0, 1) * 0.475
      ) +
      scale_x_compass() +
      ggplot2::scale_y_continuous(
        limits = range(pretty(plot_data$x, 20)),
        expand = ggplot2::expansion()
      ) +
      ggplot2::scale_color_gradientn(
        colours = openColours(cols),
        oob = scales::oob_squish,
        na.value = mis.col,
        limits = limits
      ) +
      theme_openair_radial(key.position, panel.ontop = TRUE) +
      set_extra_fontsize(extra.args) +
      annotate_compass_points(
        size = ifelse(
          extra.args$annotate %||% TRUE,
          if (is.null(extra.args$fontsize)) 3 else extra.args$fontsize / 3,
          0
        )
      ) +
      ggplot2::labs(
        color = legend_title,
        x = extra.args$xlab,
        y = extra.args$ylab,
        title = extra.args$main,
        subtitle = sub,
        caption = if (
          formula.label &&
            grepl("slope|intercept", statistic) &&
            length(pollutant) == 2
        ) {
          quickText(
            paste0("Formula: ", pollutant[1], " = m.", pollutant[2], " + c"),
            auto.text
          )
        } else {
          NULL
        }
      ) +
      {
        if (key.position %in% c("left", "right")) {
          ggplot2::theme(
            legend.key.height = ggplot2::unit(1, "null"),
            legend.key.spacing.y = ggplot2::unit(0, "cm")
          )
        }
      } +
      {
        if (key.position %in% c("top", "bottom")) {
          ggplot2::theme(
            legend.key.width = ggplot2::unit(1, "null"),
            legend.key.spacing.x = ggplot2::unit(0, "cm")
          )
        }
      } +
      get_facet(
        type,
        extra.args = extra.args,
        scales = "fixed",
        auto.text = auto.text,
        strip.position = strip.position
      ) +
      ggplot2::guides(
        color = ggplot2::guide_colorbar(
          theme = ggplot2::theme(
            legend.title.position = ifelse(
              key.position %in% c("left", "right"),
              "top",
              key.position
            ),
            legend.text.position = key.position
          )
        )
      )

    if (plot) {
      plot(thePlot)
    }

    newdata <- plot_data

    # return attribute of scale used - useful for data with negative scales such as air_temp
    attr(newdata, "radial_scale") <- range(radial_scale)
    output <- list(plot = thePlot, data = newdata, call = match.call())
    class(output) <- "openair"

    # Final return
    invisible(output)
  }


## Top-level surface-fitting worker for polarPlot().
##
## In the original code this was a nested closure (prepare.grid) that implicitly
## captured ~25 variables from the enclosing polarPlot() scope.  Lifting it to a
## top-level function makes every dependency explicit, improves readability, and
## allows independent testing.
##
## Parameters mirror the variables previously captured from polarPlot():
##   wd_col / x_col  — column names for wind direction and the radial variable
##                     (renamed to avoid shadowing the local cut() results)
polar_prepare_grid <- function(
  mydata,
  wd_col,
  x_col,
  pollutant,
  statistic,
  correlation_stats,
  Pval = NULL,
  percentile = NA,
  ws.wd,
  u,
  v,
  input.data,
  ws_bins,
  wd.int,
  max.ws,
  weights,
  min.bin,
  force.positive,
  uncertainty,
  exclude.missing,
  upper,
  k,
  ws_spread,
  wd_spread,
  kernel,
  tau,
  x_error,
  y_error,
  cluster = FALSE
) {
  ## Identify which ws and wd bins the data belong to.
  ## Note: local variables 'wd' and 'x' shadow the column-name parameters.
  wd <- cut(
    wd.int * ceiling(mydata[[wd_col]] / wd.int - 0.5),
    breaks = seq(0, 360, wd.int),
    include.lowest = TRUE
  )

  x <- cut(
    mydata[[x_col]],
    breaks = seq(0, max.ws, length = ws_bins + 1),
    include.lowest = TRUE
  )

  if (!statistic %in% c(correlation_stats, "nwr")) {
    ## Simple binned statistics
    stat_fn <- switch(
      statistic,
      frequency = \(x) length(stats::na.omit(x)),
      mean = \(x) mean(x, na.rm = TRUE),
      median = \(x) stats::median(x, na.rm = TRUE),
      max = \(x) max(x, na.rm = TRUE),
      stdev = \(x) stats::sd(x, na.rm = TRUE),
      cpf = \(x) length(which(x > Pval)) / length(x),
      cpfi = \(x) length(which(x > Pval[1] & x <= Pval[2])) / length(x),
      weighted_mean = \(x) mean(x) * length(x) / nrow(mydata),
      percentile = \(x) {
        stats::quantile(x, probs = percentile / 100, na.rm = TRUE)
      }
    )

    binned <- mydata |>
      dplyr::mutate(
        wd_bin = cut(
          wd.int * ceiling(.data[[wd_col]] / wd.int - 0.5),
          breaks = seq(0, 360, wd.int),
          include.lowest = TRUE
        ),
        x_bin = cut(
          .data[[x_col]],
          breaks = seq(0, max.ws, length = ws_bins + 1),
          include.lowest = TRUE
        )
      ) |>
      dplyr::summarise(
        value = stat_fn(.data[[pollutant]]),
        .by = c("wd_bin", "x_bin")
      ) |>
      # expand.grid(x = ws.seq, wd = wd.seq) has x_bin varying fastest, wd_bin slowest
      tidyr::complete(.data$wd_bin, .data$x_bin) |>
      dplyr::arrange(.data$wd_bin, .data$x_bin) |>
      dplyr::pull(.data$value) |>
      unname()
  } else if (toupper(statistic) == "NWR") {
    ## Non-parametric Wind Regression: fully vectorised over all grid points.
    ##
    ## For each grid point g and each observation d, compute the Gaussian kernel
    ## weight w[g,d] in one shot via outer(), then recover the weighted mean as
    ## a matrix–vector product.  Avoids the per-row dplyr rowwise() overhead.
    ws_diff <- outer(ws.wd$x, mydata[[x_col]], `-`) # n_grid × n_data
    wd_diff <- outer(ws.wd$wd, mydata[[wd_col]], `-`) # n_grid × n_data
    wd_diff[wd_diff < -180] <- wd_diff[wd_diff < -180] + 360
    W <- gauss_dens(ws_diff, wd_diff, 0, 0, ws_spread, wd_spread)
    binned <- drop(W %*% mydata[[pollutant]]) / rowSums(W)
  } else {
    ## Kernel-weighted pair-wise statistics (correlation, regression, etc.)
    binned <- mapply(
      calculate_weighted_statistics,
      ws1 = ws.wd$x,
      wd1 = ws.wd$wd,
      MoreArgs = list(
        mydata = mydata,
        statistic = statistic,
        x = x_col,
        y = wd_col,
        pol_1 = pollutant[1],
        pol_2 = pollutant[2],
        ws_spread = ws_spread,
        wd_spread = wd_spread,
        kernel = kernel,
        tau = tau,
        x_error = x_error,
        y_error = y_error
      )
    )
    binned <- ifelse(binned == Inf, NA, binned)
  }

  ## Bin frequencies — used for weighting and min.bin filtering
  bin.len <- tapply(mydata[[pollutant[1]]], list(x, wd), length)
  binned.len <- as.vector(bin.len)

  ## Apply edge weights (down-weight bins with very few observations)
  W <- rep(1, times = length(binned))
  ids <- which(binned.len == 1)
  W[ids] <- W[ids] * weights[1]
  ids <- which(binned.len == 2)
  W[ids] <- W[ids] * weights[2]
  ids <- which(binned.len == 3)
  W[ids] <- W[ids] * weights[3]

  ## Remove bins with fewer than min.bin observations
  ids <- which(binned.len < min.bin)
  binned[ids] <- NA
  binned.len[ids] <- NA

  ## GAM power transform: sqrt for positive-only data, identity otherwise
  n <- if (force.positive) 0.5 else 1

  if (!uncertainty) {
    ## Standard surface fit (no confidence intervals)
    Mgam <- try(mgcv::gam(binned^n ~ s(u, v, k = k), weights = W), TRUE)

    if (!inherits(Mgam, "try-error")) {
      pred <- as.vector(mgcv::predict.gam(Mgam, input.data))^(1 / n)

      if (cluster) {
        results <- interp_grid(input.data, z = pred, n = 101)
        int <- 101
      } else {
        results <- interp_grid(input.data, z = pred, n = 201)
        int <- 201
      }
    } else {
      results <- data.frame(u = u, v = v, z = binned)
      exclude.missing <- FALSE
      cli::cli_warn(
        c(
          "Not enough data to fit surface. Try:",
          "i" = "Reducing the value of the smoothing parameter, k to less than {k}, or",
          "i" = "Using {.arg statistic} = 'nwr'"
        )
      )
    }
  } else {
    ## Surface fit with 95% confidence intervals
    Mgam <- mgcv::gam(binned^n ~ s(u, v, k = k), weights = binned.len)
    pred_se <- mgcv::predict.gam(Mgam, input.data, se.fit = TRUE)
    uncer <- 2 * as.vector(pred_se[[2]]) ## approx 95% CI half-width

    ## Unweighted central prediction
    Mgam <- mgcv::gam(binned^n ~ s(u, v, k = k))
    pred <- as.vector(mgcv::predict.gam(Mgam, input.data))
    Lower <- (pred - uncer)^(1 / n)
    Upper <- (pred + uncer)^(1 / n)
    pred <- pred^(1 / n)

    int <- 201
    lower_uncer <- interp_grid(input.data, z = Lower, n = int) |>
      dplyr::mutate(uncertainty = "lower uncertainty")
    upper_uncer <- interp_grid(input.data, z = Upper, n = int) |>
      dplyr::mutate(uncertainty = "upper uncertainty")
    prediction <- interp_grid(input.data, z = pred, n = int) |>
      dplyr::mutate(uncertainty = "prediction")
    results <- dplyr::bind_rows(prediction, lower_uncer, upper_uncer)
  }

  ## Remove predictions that are too far from the original data
  exclude <- function(results) {
    x <- seq(-upper, upper, length = int)
    wsp <- rep(x, int)
    wdp <- rep(x, rep(int, int))

    all.data <- stats::na.omit(data.frame(u, v, binned.len))
    ind <- with(all.data, mgcv::exclude.too.far(wsp, wdp, u, v, dist = 0.05))
    results$z[ind] <- NA
    results
  }

  if (exclude.missing) {
    results <- exclude(results)
  }

  results
}


# Gaussian bivariate density function
gauss_dens <- function(x, y, mx, my, sx, sy) {
  (1 / (2 * pi * sx * sy)) *
    exp((-1 / 2) * ((x - mx)^2 / sx^2 + (y - my)^2 / sy^2))
}


## Kernel-weighted pair-wise statistics dispatcher.
##
## Takes the grid centre (ws1, wd1) directly — no more data[[1]] / across()
## indirection.  Computes Gaussian kernel weights, filters low-weight rows, then
## delegates to the appropriate sub-function.  Returns a scalar suitable for
## direct use with mapply().
calculate_weighted_statistics <- function(
  ws1,
  wd1,
  mydata,
  statistic,
  x = "ws",
  y = "wd",
  pol_1,
  pol_2,
  ws_spread,
  wd_spread,
  kernel,
  tau,
  x_error,
  y_error
) {
  weight <- NULL

  # Gaussian kernel weights centred on this ws/wd point
  ws_cent <- mydata[[x]] - ws1
  wd_cent <- mydata[[y]] - wd1
  wd_cent <- ifelse(wd_cent < -180, wd_cent + 360, wd_cent)

  weight <- gauss_dens(ws_cent, wd_cent, 0, 0, ws_spread, wd_spread)
  weight <- weight / max(weight)

  mydata$weight <- weight

  # Select columns and filter low-weight rows
  vars <- c(pol_1, pol_2, "weight")
  if (!all(is.na(c(x_error, y_error)))) {
    vars <- c(vars, x_error, y_error)
  }
  thedata <- mydata[vars]
  thedata <- thedata[which(thedata$weight > 0.001), ]

  # Insufficient data — return NA
  if (nrow(thedata) < 100) {
    return(NA)
  }

  # Dispatch to the appropriate statistical sub-function
  if (statistic %in% c("r", "Pearson", "Spearman")) {
    stat_weighted <- calc_corr_stat(thedata, pol_1, pol_2, statistic)
  } else if (statistic %in% c("slope", "intercept")) {
    stat_weighted <- calc_lm_stat(thedata, pol_1, pol_2, statistic)
  } else if (statistic == "york_slope") {
    stat_weighted <- calc_york_stat(thedata, pol_1, pol_2, x_error, y_error)
  } else if (grepl("robust", statistic, ignore.case = TRUE)) {
    stat_weighted <- calc_robust_stat(thedata, pol_1, pol_2, statistic)
  } else if (grepl("quantile", statistic, ignore.case = TRUE)) {
    stat_weighted <- calc_quantile_stat(thedata, pol_1, pol_2, statistic, tau)
  } else {
    stat_weighted <- NA
  }

  stat_weighted
}


# Weighted Pearson or Spearman correlation
calc_corr_stat <- function(thedata, pol_1, pol_2, statistic) {
  if (statistic == "r") {
    statistic <- "Pearson"
  }
  cont_corr(
    thedata[[pol_1]],
    thedata[[pol_2]],
    w = thedata$weight,
    method = statistic
  )
}


# Weighted OLS slope or intercept
calc_lm_stat <- function(thedata, pol_1, pol_2, statistic) {
  thedata <- data.frame(thedata)
  fit <- stats::lm(
    thedata[, pol_1] ~ thedata[, pol_2],
    weights = thedata[, "weight"]
  )
  if (statistic == "slope") fit$coefficients[2] else fit$coefficients[1]
}


# York regression slope (errors-in-variables)
calc_york_stat <- function(thedata, pol_1, pol_2, x_error, y_error) {
  thedata <- as.data.frame(thedata)
  result <- try(
    YorkFit(
      thedata,
      X = names(thedata)[2],
      Y = names(thedata)[1],
      Xstd = x_error,
      Ystd = y_error,
      weight = thedata$weight
    ),
    TRUE
  )
  if (!inherits(result, "try-error")) result$Slope else NA
}


# Weighted robust (M-estimator) regression slope or intercept
calc_robust_stat <- function(thedata, pol_1, pol_2, statistic) {
  thedata <- data.frame(thedata)
  fit <- suppressWarnings(
    try(
      MASS::rlm(
        thedata[, pol_1] ~ thedata[, pol_2],
        weights = thedata[, "weight"],
        method = "M"
      ),
      TRUE
    )
  )
  if (inherits(fit, "try-error")) {
    return(NA)
  }
  if (statistic == "robust_slope") fit$coefficients[2] else fit$coefficients[1]
}


# Weighted quantile regression slope or intercept
calc_quantile_stat <- function(thedata, pol_1, pol_2, statistic, tau) {
  rlang::check_installed("quantreg")
  thedata <- data.frame(thedata)
  fit <- suppressWarnings(
    try(
      quantreg::rq(
        thedata[[pol_1]] ~ thedata[[pol_2]],
        tau = tau,
        weights = thedata[["weight"]],
        method = "br"
      ),
      TRUE
    )
  )
  if (inherits(fit, "try-error")) {
    return(NA)
  }
  if (statistic == "quantile_slope") {
    fit$coefficients[2]
  } else {
    fit$coefficients[1]
  }
}


# Weighted Pearson and Spearman correlations, based on wCorr package
cont_corr <- function(x, y, w, method = c("Pearson", "Spearman")) {
  if (!is.numeric(x)) {
    x <- as.numeric(x)
  }
  if (!is.numeric(y)) {
    y <- as.numeric(y)
  }
  if (!is.numeric(w)) {
    w <- as.numeric(w)
  }
  pm <- pmatch(tolower(method[[1]]), tolower(c("Pearson", "Spearman")))
  if (pm == 2) {
    x <- wrank(x, w)
    y <- wrank(y, w)
  }
  xb <- sum(w * x) / sum(w)
  yb <- sum(w * y) / sum(w)
  numerator <- sum(w * (x - xb) * (y - yb))
  denom <- sqrt(sum(w * (x - xb)^2) * sum(w * (y - yb)^2))
  numerator / denom
}


wrank <- function(x, w = rep(1, length(x))) {
  # sort by x so we can just traverse once
  ord <- order(x)
  rord <- (seq_along(x))[order(ord)] # reverse order
  xp <- x[ord] # x, permuted
  wp <- w[ord] # weights, permuted
  rnk <- rep(NA, length(x)) # blank ranks vector
  # setup first itteration
  t1 <- 0 # total weight of lower ranked elements
  i <- 1 # index
  t2 <- 0 # total weight of tied elements (including self)
  n <- 0 # number of tied elements
  while (i < length(x)) {
    t2 <- t2 + wp[i] # tied weight increases by this unit
    n <- n + 1 # one additional tied unit
    if (xp[i + 1] != xp[i]) {
      # the next one is not a tie
      # find the rank of all tied elements
      rnki <- t1 + (n + 1) / (2 * n) * t2
      # push that rank to all tied units
      for (ii in 1:n) {
        rnk[i - ii + 1] <- rnki
      }
      # reset for next itteration
      t1 <- t1 + t2 # new total weight for lower values
      t2 <- 0 # new tied weight starts at 0
      n <- 0 # no tied units
    }
    i <- i + 1
  }
  # final row
  n <- n + 1 # add final tie
  t2 <- t2 + wp[i] # add final weight to tied weight
  rnki <- t1 + (n + 1) / (2 * n) * t2 # final rank
  # push that rank to all final tied units
  for (ii in 1:n) {
    rnk[i - ii + 1] <- rnki
  }
  # order by incoming index, so put in the original order
  rnk[rord]
}


YorkFit <- function(
  input_data,
  X = "X",
  Y = "Y",
  Xstd = "Xstd",
  Ystd = "Ystd",
  weight = NA,
  Ri = 0,
  eps = 1e-7
) {
  # weight can be supplied to apply to errors

  tol <- 1e-7 # need to refine

  # b0 initial guess at slope for OLR
  form <- stats::formula(paste(Y, "~", X))
  mod <- stats::lm(form, data = input_data)
  b0 <- mod$coefficients[2]

  X <- input_data[[X]]
  Y <- input_data[[Y]]

  Xstd <- input_data[[Xstd]]
  Ystd <- input_data[[Ystd]]

  # don't try regression if < 3 points
  if (sum(!is.na(X)) < 3 || sum(!is.na(Y)) < 3) {
    return()
  }

  # used in polar plots - Gaussian kernel weighting
  if (!all(is.na(weight))) {
    Xstd <- Xstd / weight
    Ystd <- Ystd / weight
  }

  Xw <- 1 / (Xstd^2) # X weights
  Yw <- 1 / (Ystd^2) # Y weights

  # ITERATIVE CALCULATION OF SLOPE AND INTERCEPT #

  b <- b0
  b.diff <- tol + 1

  n <- 0 # counter for debugging

  while (b.diff > tol && n < 100) {
    n <- n + 1 # counter to keep a check on convergence

    b.old <- b
    alpha.i <- sqrt(Xw * Yw)
    Wi <- (Xw * Yw) / ((b^2) * Yw + Xw - 2 * b * Ri * alpha.i)
    WiX <- Wi * X
    WiY <- Wi * Y
    sumWiX <- sum(WiX, na.rm = TRUE)
    sumWiY <- sum(WiY, na.rm = TRUE)
    sumWi <- sum(Wi, na.rm = TRUE)
    Xbar <- sumWiX / sumWi
    Ybar <- sumWiY / sumWi
    Ui <- X - Xbar
    Vi <- Y - Ybar

    Bi <- Wi * ((Ui / Yw) + (b * Vi / Xw) - (b * Ui + Vi) * Ri / alpha.i)
    wTOPint <- Bi * Wi * Vi
    wBOTint <- Bi * Wi * Ui
    sumTOP <- sum(wTOPint, na.rm = TRUE)
    sumBOT <- sum(wBOTint, na.rm = TRUE)
    b <- sumTOP / sumBOT

    # zero or problematic data
    if (anyNA(b, b.old)) {
      return(dplyr::tibble(
        Intercept = NA,
        Slope = NA,
        Int_error = NA,
        Slope_error = NA,
        OLS_slope = NA
      ))
    }

    b.diff <- abs(b - b.old)
  }

  a <- Ybar - b * Xbar
  wYorkFitCoefs <- c(a, b)

  # ERROR CALCULATION #

  Xadj <- Xbar + Bi
  WiXadj <- Wi * Xadj
  sumWiXadj <- sum(WiXadj, na.rm = TRUE)
  Xadjbar <- sumWiXadj / sumWi
  Uadj <- Xadj - Xadjbar
  wErrorTerm <- Wi * Uadj * Uadj
  errorSum <- sum(wErrorTerm, na.rm = TRUE)
  b.err <- sqrt(1 / errorSum)
  a.err <- sqrt((1 / sumWi) + (Xadjbar^2) * (b.err^2))
  wYorkFitErrors <- c(a.err, b.err)

  # GOODNESS OF FIT CALCULATION #
  lgth <- length(X)
  wSint <- Wi * (Y - b * X - a)^2
  sumSint <- sum(wSint, na.rm = TRUE)
  wYorkGOF <- c(sumSint / (lgth - 2), sqrt(2 / (lgth - 2))) # GOF (should equal 1 if assumptions are valid), #standard error in GOF

  ans <- dplyr::tibble(
    Intercept = a,
    Slope = b,
    Int_error = a.err,
    Slope_error = b.err,
    OLS_slope = b0
  )

  return(ans)
}


# Bi-linear interpolation on a regular grid.
# Allows surface predictions at a low resolution to be interpolated rather than
# using a high-k GAM directly.
interp_surface <- function(obj, loc) {
  # obj is a surface or image object like the list for contour, persp or image.
  # loc a matrix of 2-d locations -- new points to evaluate the surface.
  x <- obj$x
  y <- obj$y
  z <- obj$z
  nx <- length(x)
  ny <- length(y)
  # this clever idea for finding the intermediate coordinates at the new points
  # is from J-O Irisson
  lx <- stats::approx(x, 1:nx, loc[, 1])$y
  ly <- stats::approx(y, 1:ny, loc[, 2])$y
  lx1 <- floor(lx)
  ly1 <- floor(ly)
  # x and y distances between each new point and the closest grid point in the lower left hand corner.
  ex <- lx - lx1
  ey <- ly - ly1
  # fix up weights to handle the case when loc are equal to
  # last grid point.  These have been set to NA above.
  ex[lx1 == nx] <- 1
  ey[ly1 == ny] <- 1
  lx1[lx1 == nx] <- nx - 1
  ly1[ly1 == ny] <- ny - 1
  # bilinear interpolation finds simple weights based on the
  # the four corners of the grid box containing the new points.
  return(
    z[cbind(lx1, ly1)] *
      (1 - ex) *
      (1 - ey) +
      z[cbind(lx1 + 1, ly1)] * ex * (1 - ey) +
      z[cbind(lx1, ly1 + 1)] * (1 - ex) * ey +
      z[cbind(lx1 + 1, ly1 + 1)] * ex * ey
  )
}


# Bilinear interpolation from a coarse GAM prediction grid to a fine output grid
interp_grid <- function(input.data, x = "u", y = "v", z, n = 201) {
  # current number of points
  int <- length(unique(input.data[[x]]))

  obj <- list(
    x = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = int),
    y = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = int),
    z = matrix(z, nrow = int)
  )

  loc <- expand.grid(
    x = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = n),
    y = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = n)
  )

  res.interp <- interp_surface(obj, loc)

  out <- expand.grid(
    u = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = n),
    v = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = n)
  )

  data.frame(out, z = res.interp)
}

Try the openair package in your browser

Any scripts or data that you put into this service are public.

openair documentation built on April 2, 2026, 9:07 a.m.