#' K-means clustering of bivariate polar plots
#'
#' Function for identifying clusters in bivariate polar plots ([polarPlot()]);
#' identifying clusters in the original data for subsequent processing.
#'
#' Bivariate polar plots generated using the \code{polarPlot} function provide a
#' very useful graphical technique for identifying and characterising different
#' air pollution sources. While bivariate polar plots provide a useful graphical
#' indication of potential sources, their location and wind-speed or other
#' variable dependence, they do have several limitations. Often, a `feature'
#' will be detected in a plot but the subsequent analysis of data meeting
#' particular wind speed/direction criteria will be based only on the judgement
#' of the investigator concerning the wind speed-direction intervals of
#' interest. Furthermore, the identification of a feature can depend on the
#' choice of the colour scale used, making the process somewhat arbitrary.
#'
#' \code{polarCluster} applies Partition Around Medoids (PAM) clustering
#' techniques to [polarPlot()] surfaces to help identify potentially interesting
#' features for further analysis. Details of PAM can be found in the
#' \code{cluster} package (a core R package that will be pre-installed on all R
#' systems). PAM clustering is similar to k-means but has several advantages
#' e.g. is more robust to outliers. The clustering is based on the equal
#' contribution assumed from the u and v wind components and the associated
#' concentration. The data are standardized before clustering takes place.
#'
#' The function works best by first trying different numbers of clusters and
#' plotting them. This is achieved by setting \code{n.clusters} to be of length
#' more than 1. For example, if \code{n.clusters = 2:10} then a plot will be
#' output showing the 9 cluster levels 2 to 10.
#'
#' The clustering can also be applied to differences in polar plot surfaces (see
#' [polarDiff()]). On this case a second data frame (\code{after}) should be
#' supplied.
#'
#' Note that clustering is computationally intensive and the function can take a
#' long time to run --- particularly when the number of clusters is increased.
#' For this reason it can be a good idea to run a few clusters first to get a
#' feel for it e.g. \code{n.clusters = 2:5}.
#'
#' Once the number of clusters has been decided, the user can then run
#' \code{polarCluster} to return the original data frame together with a new
#' column \code{cluster}, which gives the cluster number as a character (see
#' example). Note that any rows where the value of \code{pollutant} is \code{NA}
#' are ignored so that the returned data frame may have fewer rows than the
#' original.
#'
#' Note that there are no automatic ways in ensuring the most appropriate number
#' of clusters as this is application dependent. However, there is often
#' a-priori information available on what different features in polar plots
#' correspond to. Nevertheless, the appropriateness of different clusters is
#' best determined by post-processing the data. The Carslaw and Beevers (2012)
#' paper discusses these issues in more detail.
#'
#' Note that unlike most other \code{openair} functions only a single
#' \code{type} \dQuote{default} is allowed.
#'
#' @inheritParams polarPlot
#' @param pollutant Mandatory. A pollutant name corresponding to a variable in a
#' data frame should be supplied e.g. \code{pollutant = "nox"}. Only one
#' pollutant can be chosen.
#' @param x Name of variable to plot against wind direction in polar
#' coordinates, the default is wind speed, \dQuote{ws}.
#' @param n.clusters Number of clusters to use. If \code{n.clusters} is more
#' than length 1, then a \code{lattice} panel plot will be output showing the
#' clusters identified for each one of \code{n.clusters}.
#' @param after The function can be applied to differences between polar plot
#' surfaces (see \link{polarDiff} for details). If an \code{after} data frame
#' is supplied, the clustering will be carried out on the differences between
#' \code{after} and \code{mydata} in the same way as \link{polarDiff}.
#' @param plot.data By default, the `data` component of `polarCluster()`
#' contains the original data frame appended with a new "cluster" column. When
#' `plot.data = TRUE`, the `data` component instead contains data to reproduce
#' the clustered polar plot itself (similar to `data` returned by
#' [polarPlot()]). This may be useful for re-plotting the `polarCluster()`
#' plot in other ways.
#' @inheritDotParams polarPlot -mydata -pollutant -x -wd -cols -angle.scale
#' -units -auto.text -plot
#' @export
#' @import cluster
#' @return an [openair][openair-package] object. The object includes four main
#' components: \code{call}, the command used to generate the plot;
#' \code{data}, by default the original data frame with a new field
#' \code{cluster} identifying the cluster, \code{clust_stats} giving the
#' contributions made by each cluster to number of measurements, their
#' percentage and the percentage by pollutant; and \code{plot}, the plot
#' itself. Note that any rows where the value of \code{pollutant} is \code{NA}
#' are ignored so that the returned data frame may have fewer rows than the
#' original.
#'
#' If the clustering is carried out considering differences, i.e., an
#' \code{after} data frame is supplied, the output also includes the
#' \code{after} data frame with cluster identified.
#' @author David Carslaw
#' @family polar directional analysis functions
#' @family cluster analysis functions
#' @references
#'
#' Carslaw, D.C., Beevers, S.D, Ropkins, K and M.C. Bell (2006). Detecting and
#' quantifying aircraft and other on-airport contributions to ambient nitrogen
#' oxides in the vicinity of a large international airport. Atmospheric
#' Environment. 40/28 pp 5424-5434.
#'
#' Carslaw, D.C., & Beevers, S.D. (2013). Characterising and understanding
#' emission sources using bivariate polar plots and k-means clustering.
#' Environmental Modelling & Software, 40, 325-329.
#' doi:10.1016/j.envsoft.2012.09.005
#' @examples
#' \dontrun{
#' ## plot 2-8 clusters. Warning! This can take several minutes...
#' polarCluster(mydata, pollutant = "nox", n.clusters = 2:8)
#'
#' # basic plot with 6 clusters
#' results <- polarCluster(mydata, pollutant = "nox", n.clusters = 6)
#'
#' ## get results, could read into a new data frame to make it easier to refer to
#' ## e.g. results <- results$data...
#' head(results$data)
#'
#' ## how many points are there in each cluster?
#' table(results$data$cluster)
#'
#' ## plot clusters 3 and 4 as a timeVariation plot using SAME colours as in
#' ## cluster plot
#' timeVariation(subset(results$data, cluster %in% c("3", "4")),
#' pollutant = "nox",
#' group = "cluster", col = openColours("Paired", 6)[c(3, 4)]
#' )
#' }
#'
polarCluster <-
function(mydata,
pollutant = "nox",
x = "ws",
wd = "wd",
n.clusters = 6,
after = NA,
cols = "Paired",
angle.scale = 315,
units = x,
auto.text = TRUE,
plot = TRUE,
plot.data = FALSE,
...) {
## avoid R check annoyances
u <- v <- z <- strip <- strip.left <- NULL
## greyscale handling
if (length(cols) == 1 && cols == "greyscale") {
trellis.par.set(list(strip.background = list(col = "white")))
}
## set graphics
current.strip <- trellis.par.get("strip.background")
current.font <- trellis.par.get("fontsize")
## reset graphic parameters
on.exit(trellis.par.set(fontsize = current.font))
# add id for later merging
mydata <- mutate(mydata, .id = 1:nrow(mydata))
if (is.data.frame(after)) {
after <- mutate(after, .id = 1:nrow(after))
data.orig.after <- after
}
data.orig <-
mydata ## keep original data so cluster can be merged with it
type <- "default"
vars <- c("wd", x, pollutant)
vars <- c(vars, "date", ".id")
mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)
if (is.data.frame(after)) {
after <- checkPrep(after, vars, type, remove.calm = FALSE)
}
max.x <- ceiling(max(mydata[, x], na.rm = TRUE))
min.ws <- floor(min(mydata[[x]], na.rm = TRUE))
upper <- max.x
min.scale <- floor(min(mydata[[x]], na.rm = TRUE))
extra.args <- list(...)
## label controls
extra.args$xlab <- if ("xlab" %in% names(extra.args)) {
quickText(extra.args$xlab, auto.text)
} else {
quickText("", auto.text)
}
extra.args$ylab <- if ("ylab" %in% names(extra.args)) {
quickText(extra.args$ylab, auto.text)
} else {
quickText("", auto.text)
}
extra.args$main <- if ("main" %in% names(extra.args)) {
quickText(extra.args$main, auto.text)
} else {
quickText("", auto.text)
}
if ("fontsize" %in% names(extra.args)) {
trellis.par.set(fontsize = list(text = extra.args$fontsize))
}
## layout default
if (!"layout" %in% names(extra.args)) {
extra.args$layout <- NULL
}
# if considering differences
if (is.data.frame(after)) {
results.grid <- polarDiff(
before = mydata,
after = after,
plot = FALSE,
pollutant = pollutant,
cluster = TRUE,
...
)$data
results.grid$z <- results.grid[[pollutant]]
} else {
results.grid <- polarPlot(
mydata,
plot = FALSE,
pollutant = pollutant,
x = x,
cluster = TRUE,
...
)$data
}
## remove missing because we don't want to find clusters for those points
## saves a lot on computation
results.grid <- na.omit(results.grid)
results.grid <- subset(results.grid, select = c(u, v, z))
## sequence of u or v, based on unique values that already exist
uv.id <- with(results.grid, sort(unique(c(u, v))))
make.clust <- function(i, results.grid) {
i <- n.clusters[i]
dat.orig <- results.grid
clusters <- pam(results.grid, i, stand = TRUE, pamonce = 3)
dat.orig$cluster <- clusters$clustering
dat.orig$nclust <- paste(i, "clusters")
dat.orig
}
# results.grid <-
# group_by(data.frame(n = seq_along(n.clusters)), n) %>%
# do(make.clust(i = .$n, results.grid))
results.grid <-
purrr::map(.x = seq_along(n.clusters),
.f = make.clust,
results.grid = results.grid, .progress = "Calculating Clusters") %>%
purrr::list_rbind()
results.grid$nclust <-
ordered(results.grid$nclust, levels = paste(n.clusters, "clusters"))
## auto scaling
nlev <- max(n.clusters) + 1
breaks <- c(0, 1:max(n.clusters))
nlev2 <- length(breaks)
col <- openColours(cols, (nlev2 - 1))
col.scale <- breaks
myform <- formula("cluster ~ u * v | nclust")
## find ids of u and v if only one cluster used
if (length(n.clusters) == 1L) {
## find indices in u-v space
results.grid$u.id <- findInterval(results.grid$u, uv.id)
results.grid$v.id <- findInterval(results.grid$v, uv.id)
mydata <- na.omit(mydata)
mydata <- transform(mydata,
u = get(x) * sin(wd * pi / 180),
v = get(x) * cos(wd * pi / 180)
)
mydata$u.id <- findInterval(mydata$u, uv.id, all.inside = TRUE)
mydata$v.id <- findInterval(mydata$v, uv.id, all.inside = TRUE)
## convert to matrix for direct lookup
## need to do this because some data are missing due to exclude.missing in polarPlot
mat.dim <-
max(results.grid[, c("u.id", "v.id")]) ## size of lookup matrix
temp <- matrix(NA, ncol = mat.dim, nrow = mat.dim)
## matrix of clusters by u.id, v.id with missings
temp[cbind(results.grid$u.id, results.grid$v.id)] <-
results.grid$cluster
## match u.id, v.id in mydata to cluster
mydata$cluster <-
as.factor(temp[cbind(mydata$u.id, mydata$v.id)])
mydata <-
select(mydata, date, cluster, .id) ## just need date/cluster
mydata <- left_join(data.orig, mydata, by = c(".id", "date"))
results <- mydata
myform <- formula("cluster ~ u * v")
# also find clusters in after data if there is any
if (is.data.frame((after))) {
after <- na.omit(after)
after <- transform(after,
u = get(x) * sin(wd * pi / 180),
v = get(x) * cos(wd * pi / 180)
)
after$u.id <- findInterval(after$u, uv.id, all.inside = TRUE)
after$v.id <- findInterval(after$v, uv.id, all.inside = TRUE)
## convert to matrix for direct lookup
## need to do this because some data are missing due to exclude.missing in polarPlot
mat.dim <-
max(results.grid[, c("u.id", "v.id")]) ## size of lookup matrix
temp <- matrix(NA, ncol = mat.dim, nrow = mat.dim)
## matrix of clusters by u.id, v.id with missings
temp[cbind(results.grid$u.id, results.grid$v.id)] <-
results.grid$cluster
## match u.id, v.id in after to cluster
after$cluster <-
as.factor(temp[cbind(after$u.id, after$v.id)])
after <-
select(after, date, cluster, .id) ## just need date/cluster
after <-
left_join(data.orig.after, after, by = c(".id", "date"))
}
}
## scaling of 'zeroed' data
## scale data by subtracting the min value
## this helps with dealing with data with offsets - e.g. negative data
mydata[[x]] <- mydata[[x]] - min(mydata[[x]], na.rm = TRUE)
intervals <- pretty(range(mydata[, x], na.rm = TRUE))
## labels for scaling
labels <- pretty(intervals + min.scale)
upper <- max(mydata[[x]], na.rm = TRUE)
## offset the lines/labels if necessary
intervals <- intervals + (min(labels) - min.scale)
## add zero in the middle if it exists
if (min.scale != 0) {
labels <- labels[-1]
intervals <- intervals[-1]
}
# interpolate grid, but first must make rectangular
# interval
int <-
as.numeric(tail(names(sort(table(
diff(results.grid$u)
))), 1))
# extent of grid required
extent <- max(abs(c(results.grid$u, results.grid$v)))
new_grid <- expand.grid(
u = seq(-extent, extent, by = int),
v = seq(-extent, extent, by = int)
)
levelplot.args <- list(
x = myform,
results.grid,
axes = FALSE,
as.table = TRUE,
col.regions = col,
region = TRUE,
aspect = 1,
at = col.scale,
par.strip.text = list(cex = 0.8),
scales = list(draw = FALSE),
xlim = c(-upper * 1.025, upper * 1.025),
ylim = c(-upper * 1.025, upper * 1.025),
colorkey = FALSE,
# legend = legend,
key = list(
rectangles = list(col = openColours(
cols,
max(n.clusters)
), border = NA),
text = list(lab = as.character(1:max(n.clusters))),
space = "right",
columns = 1,
title = "cluster",
cex.title = 1,
lines.title = 2
),
panel = function(x, y, z, subscripts, ...) {
panel.levelplot(
x,
y,
z,
subscripts,
at = col.scale,
pretty = TRUE,
col.regions = col,
labels = FALSE
)
angles <- seq(0, 2 * pi, length = 360)
sapply(intervals, function(x) {
llines(
x * sin(angles),
x * cos(angles),
col = "grey",
lty = 5
)
})
ltext(
1.07 * intervals * sin(pi * angle.scale / 180),
1.07 * intervals * cos(pi * angle.scale / 180),
sapply(paste(labels, c(
"", "", units, rep("", 7)
)), function(x) {
quickText(x, auto.text)
}),
cex = 0.7,
pos = 4
)
## add axis line to central polarPlot
larrows(-upper, 0, upper, 0, code = 3, length = 0.1)
larrows(0, -upper, 0, upper, code = 3, length = 0.1)
ltext(upper * -1 * 0.95, 0.07 * upper, "W", cex = 0.7)
ltext(0.07 * upper, upper * -1 * 0.95, "S", cex = 0.7)
ltext(0.07 * upper, upper * 0.95, "N", cex = 0.7)
ltext(upper * 0.95, 0.07 * upper, "E", cex = 0.7)
}
)
## reset for extra.args
levelplot.args <- listUpdate(levelplot.args, extra.args)
## plot
plt <- do.call(levelplot, levelplot.args)
## output ################################################################
if (plot) {
if (length(type) == 1L) {
plot(plt)
} else {
plot(useOuterStrips(plt, strip = strip, strip.left = strip.left))
}
}
out_data <- results
# currently conflicts with length(n.clusters) > 1 - resolve
if (is.function(out_data)) {
out_data <- NULL
}
# change cluster output to C1, C2 etc
if ("cluster" %in% names(out_data)) {
out_data$cluster <- paste("C", out_data$cluster, sep = "")
out_data$cluster[out_data$cluster == "CNA"] <- NA_character_
}
# print the stats but only for cluster of length 1
var_mean <- paste0("mean_", pollutant)
var_percent <- paste0(pollutant, "_percent")
if (length(n.clusters) == 1L && length(pollutant) == 1L) {
clust_stats <-
out_data %>%
dplyr::group_by(cluster) %>%
dplyr::summarise(
{{ var_mean }} := mean(.data[[pollutant]], na.rm = TRUE),
n = dplyr::n()
) %>%
na.omit() %>%
dplyr::mutate(
n_mean = n * .data[[var_mean ]],
n_percent = round(100 * n / sum(n), 1),
{{ var_percent }} := round(100 * n_mean / sum(n_mean), 1)
) %>%
dplyr::select(-n_mean)
} else {
clust_stats <- NULL
}
if (plot && length(n.clusters) == 1L) {
print(clust_stats)
}
if (plot.data) {
out_data <- results.grid
}
# output
if (is.data.frame(after)) {
output <-
list(
plot = plt,
data = out_data,
after = after,
call = match.call(),
clust_stats = clust_stats
)
} else {
output <- list(
plot = plt,
data = out_data,
call = match.call(),
clust_stats = clust_stats
)
}
class(output) <- "openair"
invisible(output)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.