Nothing
#' Function for plotting bivariate polar plots with smoothing.
#'
#' Function for plotting pollutant concentration in polar coordinates showing
#' concentration by wind speed (or another numeric variable) and direction. Mean
#' concentrations are calculated for wind speed-direction \sQuote{bins} (e.g.
#' 0-1, 1-2 m/s,... and 0-10, 10-20 degrees etc.). To aid interpretation,
#' `gam` smoothing is carried out using `mgcv`.
#'
#' The bivariate polar plot is a useful diagnostic tool for quickly gaining an
#' idea of potential sources. Wind speed is one of the most useful variables to
#' use to separate source types (see references). For example, ground-level
#' concentrations resulting from buoyant plumes from chimney stacks tend to peak
#' under higher wind speed conditions. Conversely, ground-level, non-buoyant
#' plumes such as from road traffic, tend to have highest concentrations under
#' low wind speed conditions. Other sources such as from aircraft engines also
#' show differing characteristics by wind speed.
#'
#' The function has been developed to allow variables other than wind speed to
#' be plotted with wind direction in polar coordinates. The key issue is that
#' the other variable plotted against wind direction should be discriminating in
#' some way. For example, temperature can help reveal high-level sources brought
#' down to ground level in unstable atmospheric conditions, or show the effect a
#' source emission dependent on temperature e.g. biogenic isoprene.
#'
#' The plots can vary considerably depending on how much smoothing is done. The
#' approach adopted here is based on the very flexible and capable `mgcv`
#' package that uses *Generalized Additive Models*. While methods do exist to
#' find an optimum level of smoothness, they are not necessarily useful. The
#' principal aim of `polarPlot` is as a graphical analysis rather than for
#' quantitative purposes. In this respect the smoothing aims to strike a balance
#' between revealing interesting (real) features and overly noisy data. The
#' defaults used in [polarPlot()] are based on the analysis of data from many
#' different sources. More advanced users may wish to modify the code and adopt
#' other smoothing approaches.
#'
#' Various statistics are possible to consider e.g. mean, maximum, median.
#' `statistic = "max"` is often useful for revealing sources. Pair-wise
#' statistics between two pollutants can also be calculated.
#'
#' The function can also be used to compare two pollutant species through a
#' range of pair-wise statistics (see help on `statistic`) and Grange et al.
#' (2016) (open-access publication link below).
#'
#' Wind direction is split up into 10 degree intervals and the other variable
#' (e.g. wind speed) 30 intervals. These 2D bins are then used to calculate the
#' statistics.
#'
#' These plots often show interesting features at higher wind speeds (see
#' references below). For these conditions there can be very few measurements
#' and therefore greater uncertainty in the calculation of the surface. There
#' are several ways in which this issue can be tackled. First, it is possible to
#' avoid smoothing altogether and use [polarFreq()]. Second, the effect of
#' setting a minimum number of measurements in each wind speed-direction bin can
#' be examined through `min.bin`. It is possible that a single point at high
#' wind speed conditions can strongly affect the surface prediction. Therefore,
#' setting `min.bin = 3`, for example, will remove all wind speed-direction bins
#' with fewer than 3 measurements *before* fitting the surface. Third, consider
#' setting `uncertainty = TRUE`. This option will show the predicted surface
#' together with upper and lower 95% confidence intervals, which take account of
#' the frequency of measurements.
#'
#' @inheritParams shared_openair_params
#'
#' @param mydata A data frame minimally containing `wd`, another variable to
#' plot in polar coordinates (the default is a column \dQuote{ws} --- wind
#' speed) and a pollutant. Should also contain `date` if plots by time period
#' are required.
#'
#' @param pollutant Mandatory. A pollutant name corresponding to a variable in a
#' data frame should be supplied e.g. `pollutant = "nox"`. There can also be
#' more than one pollutant specified e.g. `pollutant = c("nox", "no2")`. The
#' main use of using two or more pollutants is for model evaluation where two
#' species would be expected to have similar concentrations. This saves the
#' user stacking the data and it is possible to work with columns of data
#' directly. A typical use would be `pollutant = c("obs", "mod")` to compare
#' two columns \dQuote{obs} (the observations) and \dQuote{mod} (modelled
#' values). When pair-wise statistics such as Pearson correlation and
#' regression techniques are to be plotted, `pollutant` takes two elements
#' too. For example, `pollutant = c("bc", "pm25")` where `"bc"` is a function
#' of `"pm25"`.
#'
#' @param x Name of variable to plot against wind direction in polar
#' coordinates, the default is wind speed, \dQuote{ws}.
#'
#' @param wd Name of wind direction field.
#'
#' @param statistic The statistic that should be applied to each wind
#' speed/direction bin. Because of the smoothing involved, the colour scale
#' for some of these statistics is only to provide an indication of overall
#' pattern and should not be interpreted in concentration units e.g. for
#' `statistic = "weighted.mean"` where the bin mean is multiplied by the bin
#' frequency and divided by the total frequency. In many cases using
#' `polarFreq` will be better. Setting `statistic = "weighted.mean"` can be
#' useful because it provides an indication of the concentration * frequency
#' of occurrence and will highlight the wind speed/direction conditions that
#' dominate the overall mean.Can be:
#'
#' \itemize{ \item \dQuote{mean} (default), \dQuote{median}, \dQuote{max}
#' (maximum), \dQuote{frequency}. \dQuote{stdev} (standard deviation),
#' \dQuote{weighted.mean}.
#'
#' \item `statistic = "nwr"` Implements the Non-parametric Wind
#' Regression approach of Henry et al. (2009) that uses kernel smoothers. The
#' `openair` implementation is not identical because Gaussian kernels are
#' used for both wind direction and speed. The smoothing is controlled by
#' `ws_spread` and `wd_spread`.
#'
#' \item `statistic = "cpf"` the conditional probability function (CPF)
#' is plotted and a single (usually high) percentile level is supplied. The
#' CPF is defined as CPF = my/ny, where my is the number of samples in the y
#' bin (by default a wind direction, wind speed interval) with mixing ratios
#' greater than the *overall* percentile concentration, and ny is the
#' total number of samples in the same wind sector (see Ashbaugh et al.,
#' 1985). Note that percentile intervals can also be considered; see
#' `percentile` for details.
#'
#' \item When `statistic = "r"` or `statistic = "Pearson"`, the
#' Pearson correlation coefficient is calculated for *two* pollutants.
#' The calculation involves a weighted Pearson correlation coefficient, which
#' is weighted by Gaussian kernels for wind direction an the radial variable
#' (by default wind speed). More weight is assigned to values close to a wind
#' speed-direction interval. Kernel weighting is used to ensure that all data
#' are used rather than relying on the potentially small number of values in a
#' wind speed-direction interval.
#'
#' \item When `statistic = "Spearman"`, the Spearman correlation
#' coefficient is calculated for *two* pollutants. The calculation
#' involves a weighted Spearman correlation coefficient, which is weighted by
#' Gaussian kernels for wind direction an the radial variable (by default wind
#' speed). More weight is assigned to values close to a wind speed-direction
#' interval. Kernel weighting is used to ensure that all data are used rather
#' than relying on the potentially small number of values in a wind
#' speed-direction interval.
#'
#' \item `"robust_slope"` is another option for pair-wise statistics and
#' `"quantile.slope"`, which uses quantile regression to estimate the
#' slope for a particular quantile level (see also `tau` for setting the
#' quantile level).
#'
#' \item `"york_slope"` is another option for pair-wise statistics which
#' uses the *York regression method* to estimate the slope. In this
#' method the uncertainties in `x` and `y` are used in the
#' determination of the slope. The uncertainties are provided by
#' `x_error` and `y_error` --- see below.}
#'
#' @param limits The function does its best to choose sensible limits
#' automatically. However, there are circumstances when the user will wish to
#' set different ones. An example would be a series of plots showing each year
#' of data separately. The limits are set in the form `c(lower, upper)`, so
#' `limits = c(0, 100)` would force the plot limits to span 0-100.
#'
#' @param exclude.missing Setting this option to `TRUE` (the default) removes
#' points from the plot that are too far from the original data. The smoothing
#' routines will produce predictions at points where no data exist i.e. they
#' predict. By removing the points too far from the original data produces a
#' plot where it is clear where the original data lie. If set to `FALSE`
#' missing data will be interpolated.
#'
#' @param uncertainty Should the uncertainty in the calculated surface be shown?
#' If `TRUE` three plots are produced on the same scale showing the predicted
#' surface together with the estimated lower and upper uncertainties at the
#' 95% confidence interval. Calculating the uncertainties is useful to
#' understand whether features are real or not. For example, at high wind
#' speeds where there are few data there is greater uncertainty over the
#' predicted values. The uncertainties are calculated using the GAM and
#' weighting is done by the frequency of measurements in each wind
#' speed-direction bin. Note that if uncertainties are calculated then the
#' type is set to "default".
#'
#' @param percentile If `statistic = "percentile"` then `percentile` is used,
#' expressed from 0 to 100. Note that the percentile value is calculated in
#' the wind speed, wind direction \sQuote{bins}. For this reason it can also
#' be useful to set `min.bin` to ensure there are a sufficient number of
#' points available to estimate a percentile. See `quantile` for more details
#' of how percentiles are calculated.
#'
#' `percentile` is also used for the Conditional Probability Function (CPF)
#' plots. `percentile` can be of length two, in which case the percentile
#' *interval* is considered for use with CPF. For example, `percentile = c(90,
#' 100)` will plot the CPF for concentrations between the 90 and 100th
#' percentiles. Percentile intervals can be useful for identifying specific
#' sources. In addition, `percentile` can also be of length 3. The third value
#' is the \sQuote{trim} value to be applied. When calculating percentile
#' intervals many can cover very low values where there is no useful
#' information. The trim value ensures that values greater than or equal to
#' the trim * mean value are considered *before* the percentile intervals are
#' calculated. The effect is to extract more detail from many source
#' signatures. See the manual for examples. Finally, if the trim value is less
#' than zero the percentile range is interpreted as absolute concentration
#' values and subsetting is carried out directly.
#'
#' @param weights At the edges of the plot there may only be a few data points
#' in each wind speed-direction interval, which could in some situations
#' distort the plot if the concentrations are high. `weights` applies a
#' weighting to reduce their influence. For example and by default if only a
#' single data point exists then the weighting factor is 0.25 and for two
#' points 0.5. To not apply any weighting and use the data as is, use `weights
#' = c(1, 1, 1)`.
#'
#' An alternative to down-weighting these points they can be removed
#' altogether using `min.bin`.
#'
#' @param min.bin The minimum number of points allowed in a wind speed/wind
#' direction bin. The default is 1. A value of two requires at least 2 valid
#' records in each bin an so on; bins with less than 2 valid records are set
#' to NA. Care should be taken when using a value > 1 because of the risk of
#' removing real data points. It is recommended to consider your data with
#' care. Also, the `polarFreq` function can be of use in such circumstances.
#'
#' @param mis.col When `min.bin` is > 1 it can be useful to show where data are
#' removed on the plots. This is done by shading the missing data in
#' `mis.col`. To not highlight missing data when `min.bin` > 1 choose `mis.col
#' = "transparent"`.
#'
#' @param upper This sets the upper limit wind speed to be used. Often there are
#' only a relatively few data points at very high wind speeds and plotting all
#' of them can reduce the useful information in the plot.
#'
#' @param units The units shown on the polar axis scale.
#'
#' @param force.positive The default is `TRUE`. Sometimes if smoothing data with
#' steep gradients it is possible for predicted values to be negative.
#' `force.positive = TRUE` ensures that predictions remain positive. This is
#' useful for several reasons. First, with lots of missing data more
#' interpolation is needed and this can result in artefacts because the
#' predictions are too far from the original data. Second, if it is known
#' beforehand that the data are all positive, then this option carries that
#' assumption through to the prediction. The only likely time where setting
#' `force.positive = FALSE` would be if background concentrations were first
#' subtracted resulting in data that is legitimately negative. For the vast
#' majority of situations it is expected that the user will not need to alter
#' the default option.
#'
#' @param k This is the smoothing parameter used by the `gam` function in
#' package `mgcv`. Typically, value of around 100 (the default) seems to be
#' suitable and will resolve important features in the plot. The most
#' appropriate choice of `k` is problem-dependent; but extensive testing of
#' polar plots for many different problems suggests a value of `k` of about
#' 100 is suitable. Setting `k` to higher values will not tend to affect the
#' surface predictions by much but will add to the computation time. Lower
#' values of `k` will increase smoothing. Sometimes with few data to plot
#' `polarPlot` will fail. Under these circumstances it can be worth lowering
#' the value of `k`.
#'
#' @param normalise If `TRUE` concentrations are normalised by dividing by their
#' mean value. This is done *after* fitting the smooth surface. This option is
#' particularly useful if one is interested in the patterns of concentrations
#' for several pollutants on different scales e.g. NOx and CO. Often useful if
#' more than one `pollutant` is chosen.
#'
#' @param ws_spread The value of sigma used for Gaussian kernel weighting of
#' wind speed when `statistic = "nwr"` or when correlation and regression
#' statistics are used such as *r*. Default is `0.5`.
#'
#' @param wd_spread The value of sigma used for Gaussian kernel weighting of
#' wind direction when `statistic = "nwr"` or when correlation and regression
#' statistics are used such as *r*. Default is `4`.
#'
#' @param x_error The `x` error / uncertainty used when `statistic =
#' "york_slope"`.
#'
#' @param y_error The `y` error / uncertainty used when `statistic =
#' "york_slope"`.
#'
#' @param kernel Type of kernel used for the weighting procedure for when
#' correlation or regression techniques are used. Only `"gaussian"` is
#' supported but this may be enhanced in the future.
#'
#' @param formula.label When pair-wise statistics such as regression slopes are
#' calculated and plotted, should a formula label be displayed?
#' `formula.label` will also determine whether concentration information is
#' printed when `statistic = "cpf"`.
#'
#' @param tau The quantile to be estimated when `statistic` is set to
#' `"quantile.slope"`. Default is `0.5` which is equal to the median and will
#' be ignored if `"quantile.slope"` is not used.
#'
#' @param ... Addition options are passed on to [cutData()] for `type` handling.
#' Some additional arguments are also available:
#' - `xlab`, `ylab` and `main` override the x-axis label, y-axis label, and plot title.
#' - `layout` sets the layout of facets - e.g., `layout(2, 5)` will have 2 columns and 5 rows.
#' - `fontsize` overrides the overall font size of the plot.
#' - `annotate = FALSE` will not plot the N/E/S/W labels.
#'
#' @return an [openair][openair-package] object. `data` contains four set
#' columns: `cond`, conditioning based on `type`; `u` and `v`, the
#' translational vectors based on `ws` and `wd`; and the local `pollutant`
#' estimate.
#' @author David Carslaw
#' @family polar directional analysis functions
#' @references
#'
#' Ashbaugh, L.L., Malm, W.C., Sadeh, W.Z., 1985. A residence time probability
#' analysis of sulfur concentrations at ground canyon national park. Atmospheric
#' Environment 19 (8), 1263-1270.
#'
#' Carslaw, D.C., Beevers, S.D, Ropkins, K and M.C. Bell (2006). Detecting and
#' quantifying aircraft and other on-airport contributions to ambient nitrogen
#' oxides in the vicinity of a large international airport. Atmospheric
#' Environment. 40/28 pp 5424-5434.
#'
#' Carslaw, D.C., & Beevers, S.D. (2013). Characterising and understanding
#' emission sources using bivariate polar plots and k-means clustering.
#' Environmental Modelling & Software, 40, 325-329. DOI:
#' 10.1016/j.envsoft.2012.09.005.
#'
#' Henry, R.C., Chang, Y.S., Spiegelman, C.H., 2002. Locating nearby sources of
#' air pollution by nonparametric regression of atmospheric concentrations on
#' wind direction. Atmospheric Environment 36 (13), 2237-2244.
#'
#' Henry, R., Norris, G.A., Vedantham, R., Turner, J.R., 2009. Source region
#' identification using Kernel smoothing. Environ. Sci. Technol. 43 (11),
#' 4090e4097. DOI: 10.1021/es8011723.
#'
#' Uria-Tellaetxe, I. and D.C. Carslaw (2014). Source identification using a
#' conditional bivariate Probability function. Environmental Modelling &
#' Software, Vol. 59, 1-9.
#'
#' Westmoreland, E.J., N. Carslaw, D.C. Carslaw, A. Gillah and E. Bates (2007).
#' Analysis of air quality within a street canyon using statistical and
#' dispersion modelling techniques. Atmospheric Environment. Vol. 41(39), pp.
#' 9195-9205.
#'
#' Yu, K.N., Cheung, Y.P., Cheung, T., Henry, R.C., 2004. Identifying the impact
#' of large urban airports on local air quality by nonparametric regression.
#' Atmospheric Environment 38 (27), 4501-4507.
#'
#' Grange, S. K., Carslaw, D. C., & Lewis, A. C. 2016. Source apportionment
#' advances with bivariate polar plots, correlation, and regression techniques.
#' Atmospheric Environment. 145, 128-134. DOI: 10.1016/j.atmosenv.2016.09.016.
#' @examples
#'
#' # basic plot
#' polarPlot(mydata, pollutant = "nox")
#' \dontrun{
#'
#' # polarPlots by year on same scale
#' polarPlot(mydata, pollutant = "so2", type = "year", main = "polarPlot of so2")
#'
#' # set minimum number of bins to be used to see if pattern remains similar
#' polarPlot(mydata, pollutant = "nox", min.bin = 3)
#'
#' # plot by day of the week
#' polarPlot(mydata, pollutant = "pm10", type = "weekday")
#'
#' # show the 95% confidence intervals in the surface fitting
#' polarPlot(mydata, pollutant = "so2", uncertainty = TRUE)
#'
#'
#' # Pair-wise statistics
#' # Pearson correlation
#' polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "r")
#'
#' # Robust regression slope, takes a bit of time
#' polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "robust.slope")
#'
#' # Least squares regression works too but it is not recommended, use robust
#' # regression
#' # polarPlot(mydata, pollutant = c("pm25", "pm10"), statistic = "slope")
#' }
#'
#' @export
polarPlot <-
function(
mydata,
pollutant = "nox",
x = "ws",
wd = "wd",
type = "default",
statistic = "mean",
limits = NULL,
exclude.missing = TRUE,
uncertainty = FALSE,
percentile = NA,
cols = "default",
weights = c(0.25, 0.5, 0.75),
min.bin = 1,
mis.col = "grey",
upper = NA,
angle.scale = 315,
units = x,
force.positive = TRUE,
k = 100,
normalise = FALSE,
key.title = paste(statistic, pollutant, sep = " "),
key.position = "right",
strip.position = "top",
auto.text = TRUE,
ws_spread = 1.5,
wd_spread = 5,
x_error = NA,
y_error = NA,
kernel = "gaussian",
formula.label = TRUE,
tau = 0.5,
plot = TRUE,
key = NULL,
...
) {
if (
statistic == "percentile" && is.na(percentile[1] && statistic != "cpf")
) {
warning("percentile value missing, using 50")
percentile <- 50
}
if (statistic == "cpf" && is.na(percentile[1])) {
warning("percentile value missing, using 75")
percentile <- 75
}
# Allow for period notation
statistic <- gsub("\\.| ", "_", statistic)
# Build vector for many checks
correlation_stats <- c(
"r",
"slope",
"intercept",
"robust_slope",
"robust_intercept",
"quantile_slope",
"quantile_intercept",
"Pearson",
"Spearman",
"york_slope"
)
if (statistic %in% correlation_stats && length(pollutant) != 2) {
stop("Correlation statistic requires two pollutants.")
}
if (length(type) > 2) {
stop("Maximum number of types is 2.")
}
## can't have conditioning here
if (uncertainty) {
type <- "default"
}
if (uncertainty && length(pollutant) > 1) {
stop("Can only have one pollutant when uncertainty = TRUE")
}
if (
!statistic %in%
c(
"mean",
"median",
"frequency",
"max",
"stdev",
"weighted_mean",
"percentile",
"cpf",
"nwr",
correlation_stats
)
) {
stop("statistic '", statistic, "' not recognised.")
}
if (length(weights) != 3) {
stop("weights should be of length 3.")
}
# check key.position
key.position <- check_key_position(key.position, key)
## extra.args setup
extra.args <- list(...)
## label controls
extra.args$xlab <- if ("xlab" %in% names(extra.args)) {
quickText(extra.args$xlab, auto.text)
} else {
quickText("", auto.text)
}
extra.args$ylab <- if ("ylab" %in% names(extra.args)) {
quickText(extra.args$ylab, auto.text)
} else {
quickText("", auto.text)
}
extra.args$main <- if ("main" %in% names(extra.args)) {
quickText(extra.args$main, auto.text)
} else {
quickText("", auto.text)
}
# if clustering, return lower resolution plot
extra.args$cluster <- if ("cluster" %in% names(extra.args)) {
TRUE
} else {
FALSE
}
## need to initialise key_header & key_footer for later combination
key_header <- statistic
key_footer <- paste(pollutant, collapse = ", ")
## layout default
if (!"layout" %in% names(extra.args)) {
extra.args$layout <- NULL
}
## extract variables of interest
vars <- c(wd, x, pollutant)
if (statistic == "york_slope") {
vars <- c(vars, x_error, y_error)
}
if (any(type %in% dateTypes)) {
vars <- c(vars, "date")
}
mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)
mydata <- stats::na.omit(mydata)
# check to see if a high proportion of the data is negative
if (
length(which(mydata[pollutant] < 0)) / nrow(mydata) > 0.1 &&
force.positive
) {
warning(">10% negative data detected, set force.positive = FALSE?")
}
# scale data by subtracting the min value this helps with dealing
# with negative data on radial axis (starts from zero, always
# positive)
# return radial scale as attribute "radial_scale" on returned data for external plotting
radial_scale <- range(mydata[[x]])
mydata[[x]] <- mydata[[x]] - min(mydata[[x]], na.rm = TRUE)
# if more than one pollutant, need to stack the data and set type =
# "variable" this case is most relevent for model-measurement
# compasrions where data are in columns Can also do more than one
# pollutant and a single type that is not "default", in which case
# pollutant becomes a conditioning variable
# if statistic is 'r' we need the data for two pollutants in columns
if (length(pollutant) > 1 && !statistic %in% correlation_stats) {
if (length(type) > 1) {
cli::cli_warn("Only {.arg type} = '{type[1]}' will be used.")
type <- type[1]
}
## use pollutants as conditioning variables
mydata <- tidyr::gather(
mydata,
key = "variable",
value = "value",
dplyr::all_of(pollutant),
factor_key = TRUE
)
## now set pollutant to "value"
pollutant <- "value"
if (type == "default") {
type <- "variable"
} else {
type <- c(type, "variable")
}
}
## cutData depending on type
mydata <- cutData(mydata, type, ...)
## if upper ws not set, set it to the max to display all information
max.ws <- max(mydata[[x]], na.rm = TRUE)
min.ws <- min(mydata[[x]], na.rm = TRUE)
clip <- TRUE ## used for removing data where ws > upper
if (is.na(upper)) {
upper <- max.ws
clip <- FALSE
}
# resolution deprecated, int is resolution of GAM surface predictions over int * int grid
# 51 works well with bilinear interpolation of results
int <- 51
## binning wd data properly
## use 10 degree binning of wd if already binned, else 5
if (all(mydata[[wd]] %% 10 == 0, na.rm = TRUE)) {
wd.int <- 10
} else {
if (toupper(statistic) == "NWR") wd.int <- 2 else wd.int <- 5 ## how to split wd
}
if (toupper(statistic) == "NWR") {
ws_bins <- 40
} else {
ws_bins <- 30
}
if (statistic == "nwr") {
k <- 200
} # limit any smoothing
ws.seq <- seq(min.ws, max.ws, length = ws_bins)
wd.seq <- seq(from = wd.int, to = 360, by = wd.int) ## wind directions from wd.int to 360
ws.wd <- expand.grid(x = ws.seq, wd = wd.seq)
u <- with(ws.wd, x * sin(pi * wd / 180)) ## convert to polar coords
v <- with(ws.wd, x * cos(pi * wd / 180))
## data to predict over
input.data <- expand.grid(
u = seq(-upper, upper, length = int),
v = seq(-upper, upper, length = int)
)
## Pval is only set in the CPF branch; initialise to NULL so it always exists
Pval <- NULL
if (statistic == "cpf") {
## can be interval of percentiles or a single (threshold)
if (length(percentile) > 1) {
statistic <- "cpfi" # CPF interval
if (length(percentile) == 3) {
## in this case there is a trim value as a proprtion of the mean
## if this value <0 use absolute values as range
Mean <- mean(mydata[[pollutant]], na.rm = TRUE)
if (percentile[3] < 0) {
Pval <- percentile[1:2]
} else {
Pval <- stats::quantile(
subset(
mydata[[pollutant]],
mydata[[pollutant]] >=
Mean *
percentile[3]
),
probs = percentile[1:2] / 100,
na.rm = TRUE
)
}
} else {
Pval <- stats::quantile(
mydata[[pollutant]],
probs = percentile / 100,
na.rm = TRUE
)
}
sub <- paste0(
"CPF (",
format(Pval[1], digits = 2),
" to ",
format(Pval[2], digits = 2),
")"
)
} else {
Pval <- stats::quantile(
mydata[[pollutant]],
probs = percentile / 100,
na.rm = TRUE
)
sub <- paste0(
"CPF at the ",
percentile,
"th percentile (=",
format(Pval, digits = 2),
")"
)
if (!formula.label) sub <- NULL # no label
}
} else {
sub <- NULL
}
## Thin wrapper — delegates to polar_prepare_grid() with all dependencies
## passed explicitly rather than captured from the enclosing scope.
prepare.grid <- function(mydata) {
polar_prepare_grid(
mydata = mydata,
wd_col = wd,
x_col = x,
pollutant = pollutant,
statistic = statistic,
correlation_stats = correlation_stats,
Pval = Pval,
percentile = percentile,
ws.wd = ws.wd,
u = u,
v = v,
input.data = input.data,
ws_bins = ws_bins,
wd.int = wd.int,
max.ws = max.ws,
weights = weights,
min.bin = min.bin,
force.positive = force.positive,
uncertainty = uncertainty,
exclude.missing = exclude.missing,
upper = upper,
k = k,
ws_spread = ws_spread,
wd_spread = wd_spread,
kernel = kernel,
tau = tau,
x_error = x_error,
y_error = y_error,
cluster = extra.args$cluster
)
}
## if min.bin >1 show the missing data. Work this out by running twice:
## first time with no missings, second with min.bin.
## Plot one surface on top of the other.
if (!missing(min.bin)) {
tmp <- min.bin
min.bin <- 0
res1 <-
map_type(
mydata,
type,
prepare.grid,
.include_default = TRUE
)
min.bin <- tmp
res <-
map_type(
mydata,
type,
prepare.grid,
.include_default = TRUE
)
res$miss <- res1$z
} else {
res <- map_type(
mydata,
type,
prepare.grid,
.include_default = TRUE
)
}
## with CPF make sure not >1 due to surface fitting
if (any(res$z > 1, na.rm = TRUE) && statistic %in% c("cpf", "cpfi")) {
id <- which(res$z > 1)
res$z[id] <- 1
}
## remove wind speeds > upper to make a circle
if (clip) {
res$z[(res$u^2 + res$v^2)^0.5 > upper] <- NA
}
## normalise by divining by mean conditioning value if needed
if (normalise) {
res <- dplyr::mutate(
res,
z = .data$z / mean(.data$z, na.rm = TRUE),
.by = dplyr::all_of(type)
)
key_footer <- "normalised level"
}
# correlation notation
if (statistic %in% c("r", "Pearson", "Spearman")) {
key_footer <- paste0("corr(", pollutant[1], ", ", pollutant[2], ")")
# make sure smoothing does not results in r>1 or <-1
# sometimes happens with little data at edges
id <- which(res$z > 1)
if (length(id) > 0) {
res$z[id] <- 1
}
id <- which(res$z < -1)
if (length(id) > 0) res$z[id] <- -1
}
# if regression, set key_footer to 'm' (slope)
if (grepl("slope|intercept", statistic) && length(pollutant) == 2) {
key_footer <- "m"
}
# Labels for correlation and regression, keep lower case like other labels
if (missing(key.title)) {
if (key_header == "nwr") {
key_header <- "NWR"
}
if (key_header == "weighted_mean") {
key_header <- "weighted mean"
}
if (key_header == "percentile") {
key_header <- c(paste0(percentile, "th"), "percentile")
}
if ("cpf" %in% key_header) {
key_header <- c("CPF probability")
}
if (statistic %in% c("r", "Pearson")) {
key_header <- ("Pearson correlation")
}
if (statistic == "Spearman") {
key_header <- "Spearman correlation"
}
if (statistic == "robust_slope") {
key_header <- "robust slope"
}
if (statistic == "robust_intercept") {
key_header <- "robust intercept"
}
if (statistic == "york_slope") {
key_header <- "York regression slope"
}
if (statistic == "quantile_slope") {
key_header <- paste0("quantile slope (tau: ", tau, ")")
}
if (statistic == "quantile_intercept") {
key_header <- paste0("quantile intercept (tau: ", tau, ")")
}
}
## special handling of layout for uncertainty
if (uncertainty && is.null(extra.args$layout)) {
extra.args$layout <- c(3, 1)
}
# if uncertainty = TRUE, change type for plotting (3 panels)
if (uncertainty) {
type <- "uncertainty"
}
# Format legend title
if (missing(key.title)) {
key.title <- paste(
key_header,
key_footer,
sep = ifelse(key.position %in% c("top", "bottom"), " ", "\n")
)
}
# check if key.header / key.footer are being used
key.title <- check_key_header(key.title, extra.args)
# run through
legend_title <- quickText(key.title, auto.text)
# restore to polar coordinates
plot_data <-
res |>
dplyr::mutate(
x = sqrt(.data$u^2 + .data$v^2),
wd = 270 - atan2(.data$v, .data$u) * (180 / pi),
wd = (wd + 180) %% 360
)
# if using the min.bin arg, just drop missing z that's also missing "miss".
# otherwise drop all of missing z
if ("miss" %in% names(plot_data)) {
plot_data <- dplyr::filter(
plot_data,
!(is.na(.data$z) & is.na(.data$miss))
) |>
dplyr::select(!"miss")
} else {
plot_data <- tidyr::drop_na(plot_data, "z")
}
# restore scale pre-normalisation
plot_data <-
dplyr::mutate(
plot_data,
x = scales::rescale(
.data$x,
to = radial_scale,
from = c(0, diff(radial_scale))
)
)
# Normalise limits: must be NULL or a numeric vector of length 2
if (!is.null(limits) && !(is.numeric(limits) && length(limits) == 2)) {
limits <- NULL
}
thePlot <-
plot_data |>
dplyr::arrange(!is.na(.data$z), .data$z) |>
ggplot2::ggplot(ggplot2::aes(x = .data$wd, y = .data$x)) +
ggplot2::geom_point(ggplot2::aes(colour = .data$z)) +
ggplot2::ggproto(
NULL,
ggplot2::coord_radial(r.axis.inside = angle.scale),
inner_radius = c(0, 1) * 0.475
) +
scale_x_compass() +
ggplot2::scale_y_continuous(
limits = range(pretty(plot_data$x, 20)),
expand = ggplot2::expansion()
) +
ggplot2::scale_color_gradientn(
colours = openColours(cols),
oob = scales::oob_squish,
na.value = mis.col,
limits = limits
) +
theme_openair_radial(key.position, panel.ontop = TRUE) +
set_extra_fontsize(extra.args) +
annotate_compass_points(
size = ifelse(
extra.args$annotate %||% TRUE,
if (is.null(extra.args$fontsize)) 3 else extra.args$fontsize / 3,
0
)
) +
ggplot2::labs(
color = legend_title,
x = extra.args$xlab,
y = extra.args$ylab,
title = extra.args$main,
subtitle = sub,
caption = if (
formula.label &&
grepl("slope|intercept", statistic) &&
length(pollutant) == 2
) {
quickText(
paste0("Formula: ", pollutant[1], " = m.", pollutant[2], " + c"),
auto.text
)
} else {
NULL
}
) +
{
if (key.position %in% c("left", "right")) {
ggplot2::theme(
legend.key.height = ggplot2::unit(1, "null"),
legend.key.spacing.y = ggplot2::unit(0, "cm")
)
}
} +
{
if (key.position %in% c("top", "bottom")) {
ggplot2::theme(
legend.key.width = ggplot2::unit(1, "null"),
legend.key.spacing.x = ggplot2::unit(0, "cm")
)
}
} +
get_facet(
type,
extra.args = extra.args,
scales = "fixed",
auto.text = auto.text,
strip.position = strip.position
) +
ggplot2::guides(
color = ggplot2::guide_colorbar(
theme = ggplot2::theme(
legend.title.position = ifelse(
key.position %in% c("left", "right"),
"top",
key.position
),
legend.text.position = key.position
)
)
)
if (plot) {
plot(thePlot)
}
newdata <- plot_data
# return attribute of scale used - useful for data with negative scales such as air_temp
attr(newdata, "radial_scale") <- range(radial_scale)
output <- list(plot = thePlot, data = newdata, call = match.call())
class(output) <- "openair"
# Final return
invisible(output)
}
## Top-level surface-fitting worker for polarPlot().
##
## In the original code this was a nested closure (prepare.grid) that implicitly
## captured ~25 variables from the enclosing polarPlot() scope. Lifting it to a
## top-level function makes every dependency explicit, improves readability, and
## allows independent testing.
##
## Parameters mirror the variables previously captured from polarPlot():
## wd_col / x_col — column names for wind direction and the radial variable
## (renamed to avoid shadowing the local cut() results)
polar_prepare_grid <- function(
mydata,
wd_col,
x_col,
pollutant,
statistic,
correlation_stats,
Pval = NULL,
percentile = NA,
ws.wd,
u,
v,
input.data,
ws_bins,
wd.int,
max.ws,
weights,
min.bin,
force.positive,
uncertainty,
exclude.missing,
upper,
k,
ws_spread,
wd_spread,
kernel,
tau,
x_error,
y_error,
cluster = FALSE
) {
## Identify which ws and wd bins the data belong to.
## Note: local variables 'wd' and 'x' shadow the column-name parameters.
wd <- cut(
wd.int * ceiling(mydata[[wd_col]] / wd.int - 0.5),
breaks = seq(0, 360, wd.int),
include.lowest = TRUE
)
x <- cut(
mydata[[x_col]],
breaks = seq(0, max.ws, length = ws_bins + 1),
include.lowest = TRUE
)
if (!statistic %in% c(correlation_stats, "nwr")) {
## Simple binned statistics
stat_fn <- switch(
statistic,
frequency = \(x) length(stats::na.omit(x)),
mean = \(x) mean(x, na.rm = TRUE),
median = \(x) stats::median(x, na.rm = TRUE),
max = \(x) max(x, na.rm = TRUE),
stdev = \(x) stats::sd(x, na.rm = TRUE),
cpf = \(x) length(which(x > Pval)) / length(x),
cpfi = \(x) length(which(x > Pval[1] & x <= Pval[2])) / length(x),
weighted_mean = \(x) mean(x) * length(x) / nrow(mydata),
percentile = \(x) {
stats::quantile(x, probs = percentile / 100, na.rm = TRUE)
}
)
binned <- mydata |>
dplyr::mutate(
wd_bin = cut(
wd.int * ceiling(.data[[wd_col]] / wd.int - 0.5),
breaks = seq(0, 360, wd.int),
include.lowest = TRUE
),
x_bin = cut(
.data[[x_col]],
breaks = seq(0, max.ws, length = ws_bins + 1),
include.lowest = TRUE
)
) |>
dplyr::summarise(
value = stat_fn(.data[[pollutant]]),
.by = c("wd_bin", "x_bin")
) |>
# expand.grid(x = ws.seq, wd = wd.seq) has x_bin varying fastest, wd_bin slowest
tidyr::complete(.data$wd_bin, .data$x_bin) |>
dplyr::arrange(.data$wd_bin, .data$x_bin) |>
dplyr::pull(.data$value) |>
unname()
} else if (toupper(statistic) == "NWR") {
## Non-parametric Wind Regression: fully vectorised over all grid points.
##
## For each grid point g and each observation d, compute the Gaussian kernel
## weight w[g,d] in one shot via outer(), then recover the weighted mean as
## a matrix–vector product. Avoids the per-row dplyr rowwise() overhead.
ws_diff <- outer(ws.wd$x, mydata[[x_col]], `-`) # n_grid × n_data
wd_diff <- outer(ws.wd$wd, mydata[[wd_col]], `-`) # n_grid × n_data
wd_diff[wd_diff < -180] <- wd_diff[wd_diff < -180] + 360
W <- gauss_dens(ws_diff, wd_diff, 0, 0, ws_spread, wd_spread)
binned <- drop(W %*% mydata[[pollutant]]) / rowSums(W)
} else {
## Kernel-weighted pair-wise statistics (correlation, regression, etc.)
binned <- mapply(
calculate_weighted_statistics,
ws1 = ws.wd$x,
wd1 = ws.wd$wd,
MoreArgs = list(
mydata = mydata,
statistic = statistic,
x = x_col,
y = wd_col,
pol_1 = pollutant[1],
pol_2 = pollutant[2],
ws_spread = ws_spread,
wd_spread = wd_spread,
kernel = kernel,
tau = tau,
x_error = x_error,
y_error = y_error
)
)
binned <- ifelse(binned == Inf, NA, binned)
}
## Bin frequencies — used for weighting and min.bin filtering
bin.len <- tapply(mydata[[pollutant[1]]], list(x, wd), length)
binned.len <- as.vector(bin.len)
## Apply edge weights (down-weight bins with very few observations)
W <- rep(1, times = length(binned))
ids <- which(binned.len == 1)
W[ids] <- W[ids] * weights[1]
ids <- which(binned.len == 2)
W[ids] <- W[ids] * weights[2]
ids <- which(binned.len == 3)
W[ids] <- W[ids] * weights[3]
## Remove bins with fewer than min.bin observations
ids <- which(binned.len < min.bin)
binned[ids] <- NA
binned.len[ids] <- NA
## GAM power transform: sqrt for positive-only data, identity otherwise
n <- if (force.positive) 0.5 else 1
if (!uncertainty) {
## Standard surface fit (no confidence intervals)
Mgam <- try(mgcv::gam(binned^n ~ s(u, v, k = k), weights = W), TRUE)
if (!inherits(Mgam, "try-error")) {
pred <- as.vector(mgcv::predict.gam(Mgam, input.data))^(1 / n)
if (cluster) {
results <- interp_grid(input.data, z = pred, n = 101)
int <- 101
} else {
results <- interp_grid(input.data, z = pred, n = 201)
int <- 201
}
} else {
results <- data.frame(u = u, v = v, z = binned)
exclude.missing <- FALSE
cli::cli_warn(
c(
"Not enough data to fit surface. Try:",
"i" = "Reducing the value of the smoothing parameter, k to less than {k}, or",
"i" = "Using {.arg statistic} = 'nwr'"
)
)
}
} else {
## Surface fit with 95% confidence intervals
Mgam <- mgcv::gam(binned^n ~ s(u, v, k = k), weights = binned.len)
pred_se <- mgcv::predict.gam(Mgam, input.data, se.fit = TRUE)
uncer <- 2 * as.vector(pred_se[[2]]) ## approx 95% CI half-width
## Unweighted central prediction
Mgam <- mgcv::gam(binned^n ~ s(u, v, k = k))
pred <- as.vector(mgcv::predict.gam(Mgam, input.data))
Lower <- (pred - uncer)^(1 / n)
Upper <- (pred + uncer)^(1 / n)
pred <- pred^(1 / n)
int <- 201
lower_uncer <- interp_grid(input.data, z = Lower, n = int) |>
dplyr::mutate(uncertainty = "lower uncertainty")
upper_uncer <- interp_grid(input.data, z = Upper, n = int) |>
dplyr::mutate(uncertainty = "upper uncertainty")
prediction <- interp_grid(input.data, z = pred, n = int) |>
dplyr::mutate(uncertainty = "prediction")
results <- dplyr::bind_rows(prediction, lower_uncer, upper_uncer)
}
## Remove predictions that are too far from the original data
exclude <- function(results) {
x <- seq(-upper, upper, length = int)
wsp <- rep(x, int)
wdp <- rep(x, rep(int, int))
all.data <- stats::na.omit(data.frame(u, v, binned.len))
ind <- with(all.data, mgcv::exclude.too.far(wsp, wdp, u, v, dist = 0.05))
results$z[ind] <- NA
results
}
if (exclude.missing) {
results <- exclude(results)
}
results
}
# Gaussian bivariate density function
gauss_dens <- function(x, y, mx, my, sx, sy) {
(1 / (2 * pi * sx * sy)) *
exp((-1 / 2) * ((x - mx)^2 / sx^2 + (y - my)^2 / sy^2))
}
## Kernel-weighted pair-wise statistics dispatcher.
##
## Takes the grid centre (ws1, wd1) directly — no more data[[1]] / across()
## indirection. Computes Gaussian kernel weights, filters low-weight rows, then
## delegates to the appropriate sub-function. Returns a scalar suitable for
## direct use with mapply().
calculate_weighted_statistics <- function(
ws1,
wd1,
mydata,
statistic,
x = "ws",
y = "wd",
pol_1,
pol_2,
ws_spread,
wd_spread,
kernel,
tau,
x_error,
y_error
) {
weight <- NULL
# Gaussian kernel weights centred on this ws/wd point
ws_cent <- mydata[[x]] - ws1
wd_cent <- mydata[[y]] - wd1
wd_cent <- ifelse(wd_cent < -180, wd_cent + 360, wd_cent)
weight <- gauss_dens(ws_cent, wd_cent, 0, 0, ws_spread, wd_spread)
weight <- weight / max(weight)
mydata$weight <- weight
# Select columns and filter low-weight rows
vars <- c(pol_1, pol_2, "weight")
if (!all(is.na(c(x_error, y_error)))) {
vars <- c(vars, x_error, y_error)
}
thedata <- mydata[vars]
thedata <- thedata[which(thedata$weight > 0.001), ]
# Insufficient data — return NA
if (nrow(thedata) < 100) {
return(NA)
}
# Dispatch to the appropriate statistical sub-function
if (statistic %in% c("r", "Pearson", "Spearman")) {
stat_weighted <- calc_corr_stat(thedata, pol_1, pol_2, statistic)
} else if (statistic %in% c("slope", "intercept")) {
stat_weighted <- calc_lm_stat(thedata, pol_1, pol_2, statistic)
} else if (statistic == "york_slope") {
stat_weighted <- calc_york_stat(thedata, pol_1, pol_2, x_error, y_error)
} else if (grepl("robust", statistic, ignore.case = TRUE)) {
stat_weighted <- calc_robust_stat(thedata, pol_1, pol_2, statistic)
} else if (grepl("quantile", statistic, ignore.case = TRUE)) {
stat_weighted <- calc_quantile_stat(thedata, pol_1, pol_2, statistic, tau)
} else {
stat_weighted <- NA
}
stat_weighted
}
# Weighted Pearson or Spearman correlation
calc_corr_stat <- function(thedata, pol_1, pol_2, statistic) {
if (statistic == "r") {
statistic <- "Pearson"
}
cont_corr(
thedata[[pol_1]],
thedata[[pol_2]],
w = thedata$weight,
method = statistic
)
}
# Weighted OLS slope or intercept
calc_lm_stat <- function(thedata, pol_1, pol_2, statistic) {
thedata <- data.frame(thedata)
fit <- stats::lm(
thedata[, pol_1] ~ thedata[, pol_2],
weights = thedata[, "weight"]
)
if (statistic == "slope") fit$coefficients[2] else fit$coefficients[1]
}
# York regression slope (errors-in-variables)
calc_york_stat <- function(thedata, pol_1, pol_2, x_error, y_error) {
thedata <- as.data.frame(thedata)
result <- try(
YorkFit(
thedata,
X = names(thedata)[2],
Y = names(thedata)[1],
Xstd = x_error,
Ystd = y_error,
weight = thedata$weight
),
TRUE
)
if (!inherits(result, "try-error")) result$Slope else NA
}
# Weighted robust (M-estimator) regression slope or intercept
calc_robust_stat <- function(thedata, pol_1, pol_2, statistic) {
thedata <- data.frame(thedata)
fit <- suppressWarnings(
try(
MASS::rlm(
thedata[, pol_1] ~ thedata[, pol_2],
weights = thedata[, "weight"],
method = "M"
),
TRUE
)
)
if (inherits(fit, "try-error")) {
return(NA)
}
if (statistic == "robust_slope") fit$coefficients[2] else fit$coefficients[1]
}
# Weighted quantile regression slope or intercept
calc_quantile_stat <- function(thedata, pol_1, pol_2, statistic, tau) {
rlang::check_installed("quantreg")
thedata <- data.frame(thedata)
fit <- suppressWarnings(
try(
quantreg::rq(
thedata[[pol_1]] ~ thedata[[pol_2]],
tau = tau,
weights = thedata[["weight"]],
method = "br"
),
TRUE
)
)
if (inherits(fit, "try-error")) {
return(NA)
}
if (statistic == "quantile_slope") {
fit$coefficients[2]
} else {
fit$coefficients[1]
}
}
# Weighted Pearson and Spearman correlations, based on wCorr package
cont_corr <- function(x, y, w, method = c("Pearson", "Spearman")) {
if (!is.numeric(x)) {
x <- as.numeric(x)
}
if (!is.numeric(y)) {
y <- as.numeric(y)
}
if (!is.numeric(w)) {
w <- as.numeric(w)
}
pm <- pmatch(tolower(method[[1]]), tolower(c("Pearson", "Spearman")))
if (pm == 2) {
x <- wrank(x, w)
y <- wrank(y, w)
}
xb <- sum(w * x) / sum(w)
yb <- sum(w * y) / sum(w)
numerator <- sum(w * (x - xb) * (y - yb))
denom <- sqrt(sum(w * (x - xb)^2) * sum(w * (y - yb)^2))
numerator / denom
}
wrank <- function(x, w = rep(1, length(x))) {
# sort by x so we can just traverse once
ord <- order(x)
rord <- (seq_along(x))[order(ord)] # reverse order
xp <- x[ord] # x, permuted
wp <- w[ord] # weights, permuted
rnk <- rep(NA, length(x)) # blank ranks vector
# setup first itteration
t1 <- 0 # total weight of lower ranked elements
i <- 1 # index
t2 <- 0 # total weight of tied elements (including self)
n <- 0 # number of tied elements
while (i < length(x)) {
t2 <- t2 + wp[i] # tied weight increases by this unit
n <- n + 1 # one additional tied unit
if (xp[i + 1] != xp[i]) {
# the next one is not a tie
# find the rank of all tied elements
rnki <- t1 + (n + 1) / (2 * n) * t2
# push that rank to all tied units
for (ii in 1:n) {
rnk[i - ii + 1] <- rnki
}
# reset for next itteration
t1 <- t1 + t2 # new total weight for lower values
t2 <- 0 # new tied weight starts at 0
n <- 0 # no tied units
}
i <- i + 1
}
# final row
n <- n + 1 # add final tie
t2 <- t2 + wp[i] # add final weight to tied weight
rnki <- t1 + (n + 1) / (2 * n) * t2 # final rank
# push that rank to all final tied units
for (ii in 1:n) {
rnk[i - ii + 1] <- rnki
}
# order by incoming index, so put in the original order
rnk[rord]
}
YorkFit <- function(
input_data,
X = "X",
Y = "Y",
Xstd = "Xstd",
Ystd = "Ystd",
weight = NA,
Ri = 0,
eps = 1e-7
) {
# weight can be supplied to apply to errors
tol <- 1e-7 # need to refine
# b0 initial guess at slope for OLR
form <- stats::formula(paste(Y, "~", X))
mod <- stats::lm(form, data = input_data)
b0 <- mod$coefficients[2]
X <- input_data[[X]]
Y <- input_data[[Y]]
Xstd <- input_data[[Xstd]]
Ystd <- input_data[[Ystd]]
# don't try regression if < 3 points
if (sum(!is.na(X)) < 3 || sum(!is.na(Y)) < 3) {
return()
}
# used in polar plots - Gaussian kernel weighting
if (!all(is.na(weight))) {
Xstd <- Xstd / weight
Ystd <- Ystd / weight
}
Xw <- 1 / (Xstd^2) # X weights
Yw <- 1 / (Ystd^2) # Y weights
# ITERATIVE CALCULATION OF SLOPE AND INTERCEPT #
b <- b0
b.diff <- tol + 1
n <- 0 # counter for debugging
while (b.diff > tol && n < 100) {
n <- n + 1 # counter to keep a check on convergence
b.old <- b
alpha.i <- sqrt(Xw * Yw)
Wi <- (Xw * Yw) / ((b^2) * Yw + Xw - 2 * b * Ri * alpha.i)
WiX <- Wi * X
WiY <- Wi * Y
sumWiX <- sum(WiX, na.rm = TRUE)
sumWiY <- sum(WiY, na.rm = TRUE)
sumWi <- sum(Wi, na.rm = TRUE)
Xbar <- sumWiX / sumWi
Ybar <- sumWiY / sumWi
Ui <- X - Xbar
Vi <- Y - Ybar
Bi <- Wi * ((Ui / Yw) + (b * Vi / Xw) - (b * Ui + Vi) * Ri / alpha.i)
wTOPint <- Bi * Wi * Vi
wBOTint <- Bi * Wi * Ui
sumTOP <- sum(wTOPint, na.rm = TRUE)
sumBOT <- sum(wBOTint, na.rm = TRUE)
b <- sumTOP / sumBOT
# zero or problematic data
if (anyNA(b, b.old)) {
return(dplyr::tibble(
Intercept = NA,
Slope = NA,
Int_error = NA,
Slope_error = NA,
OLS_slope = NA
))
}
b.diff <- abs(b - b.old)
}
a <- Ybar - b * Xbar
wYorkFitCoefs <- c(a, b)
# ERROR CALCULATION #
Xadj <- Xbar + Bi
WiXadj <- Wi * Xadj
sumWiXadj <- sum(WiXadj, na.rm = TRUE)
Xadjbar <- sumWiXadj / sumWi
Uadj <- Xadj - Xadjbar
wErrorTerm <- Wi * Uadj * Uadj
errorSum <- sum(wErrorTerm, na.rm = TRUE)
b.err <- sqrt(1 / errorSum)
a.err <- sqrt((1 / sumWi) + (Xadjbar^2) * (b.err^2))
wYorkFitErrors <- c(a.err, b.err)
# GOODNESS OF FIT CALCULATION #
lgth <- length(X)
wSint <- Wi * (Y - b * X - a)^2
sumSint <- sum(wSint, na.rm = TRUE)
wYorkGOF <- c(sumSint / (lgth - 2), sqrt(2 / (lgth - 2))) # GOF (should equal 1 if assumptions are valid), #standard error in GOF
ans <- dplyr::tibble(
Intercept = a,
Slope = b,
Int_error = a.err,
Slope_error = b.err,
OLS_slope = b0
)
return(ans)
}
# Bi-linear interpolation on a regular grid.
# Allows surface predictions at a low resolution to be interpolated rather than
# using a high-k GAM directly.
interp_surface <- function(obj, loc) {
# obj is a surface or image object like the list for contour, persp or image.
# loc a matrix of 2-d locations -- new points to evaluate the surface.
x <- obj$x
y <- obj$y
z <- obj$z
nx <- length(x)
ny <- length(y)
# this clever idea for finding the intermediate coordinates at the new points
# is from J-O Irisson
lx <- stats::approx(x, 1:nx, loc[, 1])$y
ly <- stats::approx(y, 1:ny, loc[, 2])$y
lx1 <- floor(lx)
ly1 <- floor(ly)
# x and y distances between each new point and the closest grid point in the lower left hand corner.
ex <- lx - lx1
ey <- ly - ly1
# fix up weights to handle the case when loc are equal to
# last grid point. These have been set to NA above.
ex[lx1 == nx] <- 1
ey[ly1 == ny] <- 1
lx1[lx1 == nx] <- nx - 1
ly1[ly1 == ny] <- ny - 1
# bilinear interpolation finds simple weights based on the
# the four corners of the grid box containing the new points.
return(
z[cbind(lx1, ly1)] *
(1 - ex) *
(1 - ey) +
z[cbind(lx1 + 1, ly1)] * ex * (1 - ey) +
z[cbind(lx1, ly1 + 1)] * (1 - ex) * ey +
z[cbind(lx1 + 1, ly1 + 1)] * ex * ey
)
}
# Bilinear interpolation from a coarse GAM prediction grid to a fine output grid
interp_grid <- function(input.data, x = "u", y = "v", z, n = 201) {
# current number of points
int <- length(unique(input.data[[x]]))
obj <- list(
x = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = int),
y = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = int),
z = matrix(z, nrow = int)
)
loc <- expand.grid(
x = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = n),
y = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = n)
)
res.interp <- interp_surface(obj, loc)
out <- expand.grid(
u = seq(min(input.data[[x]]), max(input.data[[x]]), length.out = n),
v = seq(min(input.data[[y]]), max(input.data[[y]]), length.out = n)
)
data.frame(out, z = res.interp)
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.