R/timeVariation.R

Defines functions timeVariation

Documented in timeVariation

#' Diurnal, day of the week and monthly variation
#'
#' Plots the diurnal, day of the week and monthly variation for different
#' variables, typically pollutant concentrations. Four separate plots are
#' produced.
#'
#' The variation of pollutant concentrations by hour of the day and day of the
#' week etc. can reveal many interesting features that relate to source types
#' and meteorology. For traffic sources, there are often important differences
#' in the way vehicles vary by vehicles type e.g. less heavy vehicles at
#' weekends.
#'
#' The \code{timeVariation} function makes it easy to see how concentrations
#' (and many other variable types) vary by hour of the day and day of the week.
#'
#' The plots also show the 95\% confidence intervals in the mean. The 95\%
#' confidence intervals in the mean are calculated through bootstrap
#' simulations, which will provide more robust estimates of the confidence
#' intervals (particularly when there are relatively few data).
#'
#' The function can handle multiple pollutants and uses the flexible \code{type}
#' option to provide separate panels for each 'type' --- see \code{cutData} for
#' more details. \code{timeVariation} can also accept a \code{group} option
#' which is useful if data are stacked. This will work in a similar way to
#' having multiple pollutants in separate columns.
#'
#' The user can supply their own \code{ylim} e.g. \code{ylim = c(0, 200)} that
#' will be used for all plots. \code{ylim} can also be a list of length four to
#' control the y-limits on each individual plot e.g. \code{ylim =
#' list(c(-100,500), c(200, 300), c(-400,400), c(50,70))}. These pairs
#' correspond to the hour, weekday, month and day-hour plots respectively.
#'
#' The option \code{difference} will calculate the difference in means of two
#' pollutants together with bootstrap estimates of the 95\% confidence intervals
#' in the difference in the mean. This works in two ways: either two pollutants
#' are supplied in separate columns e.g. \code{pollutant = c("no2", "o3")}, or
#' there are two unique values of \code{group}. The difference is calculated as
#' the second pollutant minus the first and is labelled as such. Considering
#' differences in this way can provide many useful insights and is particularly
#' useful for model evaluation when information is needed about where a model
#' differs from observations by many different time scales. The manual contains
#' various examples of using \code{difference = TRUE}.
#'
#' Note also that the \code{timeVariation} function works well on a subset of
#' data and in conjunction with other plots. For example, a
#' \code{\link{polarPlot}} may highlight an interesting feature for a particular
#' wind speed/direction range. By filtering for those conditions
#' \code{timeVariation} can help determine whether the temporal variation of
#' that feature differs from other features --- and help with source
#' identification.
#'
#' In addition, \code{timeVariation} will work well with other variables if
#' available. Examples include meteorological and traffic flow data.
#'
#' Depending on the choice of statistic, a subheading is added. Users can
#' control the text in the subheading through the use of \code{sub} e.g.
#' \code{sub = ""} will remove any subheading.
#'
#' @param mydata A data frame of hourly (or higher temporal resolution data).
#'   Must include a \code{date} field and at least one variable to plot.
#' @param pollutant Name of variable to plot. Two or more pollutants can be
#'   plotted, in which case a form like \code{pollutant = c("nox", "co")} should
#'   be used.
#' @param local.tz Should the results be calculated in local time that includes
#'   a treatment of daylight savings time (DST)? The default is not to consider
#'   DST issues, provided the data were imported without a DST offset. Emissions
#'   activity tends to occur at local time e.g. rush hour is at 8 am every day.
#'   When the clocks go forward in spring, the emissions are effectively
#'   released into the atmosphere typically 1 hour earlier during the summertime
#'   i.e. when DST applies. When plotting diurnal profiles, this has the effect
#'   of \dQuote{smearing-out} the concentrations. Sometimes, a useful approach
#'   is to express time as local time. This correction tends to produce
#'   better-defined diurnal profiles of concentration (or other variables) and
#'   allows a better comparison to be made with emissions/activity data. If set
#'   to \code{FALSE} then GMT is used. Examples of usage include \code{local.tz
#'   = "Europe/London"}, \code{local.tz = "America/New_York"}. See
#'   \code{cutData} and \code{import} for more details.
#' @param normalise Should variables be normalised? The default is \code{FALSE}.
#'   If \code{TRUE} then the variable(s) are divided by their mean values. This
#'   helps to compare the shape of the diurnal trends for variables on very
#'   different scales.
#' @param xlab x-axis label; one for each sub-plot.
#' @param name.pol Names to be given to the pollutant(s). This is useful if you
#'   want to give a fuller description of the variables, maybe also including
#'   subscripts etc.
#' @param type \code{type} determines how the data are split i.e. conditioned,
#'   and then plotted. The default is will produce a single plot using the
#'   entire data. Type can be one of the built-in types as detailed in
#'   \code{cutData} e.g. \dQuote{season}, \dQuote{year}, \dQuote{weekday} and so
#'   on. For example, \code{type = "season"} will produce four plots --- one for
#'   each season.
#'
#'   It is also possible to choose \code{type} as another variable in the data
#'   frame. If that variable is numeric, then the data will be split into four
#'   quantiles (if possible) and labelled accordingly. If type is an existing
#'   character or factor variable, then those categories/levels will be used
#'   directly. This offers great flexibility for understanding the variation of
#'   different variables and how they depend on one another.
#'
#'   Only one \code{type} is allowed in\code{timeVariation}.
#' @param group This sets the grouping variable to be used. For example, if a
#'   data frame had a column \code{site} setting \code{group = "site"} will plot
#'   all sites together in each panel. See examples below.
#' @param difference If two pollutants are chosen then setting \code{difference
#'   = TRUE} will also plot the difference in means between the two variables as
#'   \code{pollutant[2] - pollutant[1]}. Bootstrap 95\% confidence intervals of
#'   the difference in means are also calculated. A horizontal dashed line is
#'   shown at y = 0. The difference can also be calculated if there is a column
#'   that identifies two groups e.g. having used \code{splitByDate}. In this
#'   case it is possible to call \code{timeVariation} with the option
#'   \code{group = "split.by"} and \code{difference = TRUE}.
#' @param statistic Can be \dQuote{mean} (default) or \dQuote{median}. If the
#'   statistic is \sQuote{mean} then the mean line and the 95\% confidence
#'   interval in the mean are plotted by default. If the statistic is
#'   \sQuote{median} then the median line is plotted together with the 5/95 and
#'   25/75th quantiles are plotted. Users can control the confidence intervals
#'   with \code{conf.int}.
#' @param conf.int The confidence intervals to be plotted. If \code{statistic =
#'   "mean"} then the confidence intervals in the mean are plotted. If
#'   \code{statistic = "median"} then the \code{conf.int} and \code{1 -
#'   conf.int} \emph{quantiles} are plotted. \code{conf.int} can be of length 2,
#'   which is most useful for showing quantiles. For example \code{conf.int =
#'   c(0.75, 0.99)} will yield a plot showing the median, 25/75 and 5/95th
#'   quantiles.
#' @param B Number of bootstrap replicates to use. Can be useful to reduce this
#'   value when there are a large number of observations available to increase
#'   the speed of the calculations without affecting the 95\% confidence
#'   interval calculations by much.
#' @param ci Should confidence intervals be shown? The default is \code{TRUE}.
#'   Setting this to \code{FALSE} can be useful if multiple pollutants are
#'   chosen where over-lapping confidence intervals can over complicate plots.
#' @param cols Colours to be used for plotting. Options include
#'   \dQuote{default}, \dQuote{increment}, \dQuote{heat}, \dQuote{jet} and
#'   \code{RColorBrewer} colours --- see the \code{openair} \code{openColours}
#'   function for more details. For user defined the user can supply a list of
#'   colour names recognised by R (type \code{colours()} to see the full list).
#'   An example would be \code{cols = c("yellow", "green", "blue")}
#' @param ref.y A list with details of the horizontal lines to be added
#'   representing reference line(s). For example, \code{ref.y = list(h = 50, lty
#'   = 5)} will add a dashed horizontal line at 50. Several lines can be plotted
#'   e.g. \code{ref.y = list(h = c(50, 100), lty = c(1, 5), col = c("green",
#'   "blue"))}. See \code{panel.abline} in the \code{lattice} package for more
#'   details on adding/controlling lines.
#' @param key By default \code{timeVariation} produces four plots on one page.
#'   While it is useful to see these plots together, it is sometimes necessary
#'   just to use one for a report. If \code{key} is \code{TRUE}, a key is added
#'   to all plots allowing the extraction of a single plot \emph{with} key. See
#'   below for an example.
#' @param key.columns Number of columns to be used in the key. With many
#'   pollutants a single column can make to key too wide. The user can thus
#'   choose to use several columns by setting \code{columns} to be less than the
#'   number of pollutants.
#' @param start.day What day of the week should the plots start on? The user can
#'   change the start day by supplying an integer between 0 and 6. Sunday = 0,
#'   Monday = 1, \ldots For example to start the weekday plots on a Saturday,
#'   choose \code{start.day = 6}.
#' @param panel.gap The gap between panels in the hour-day plot.
#' @param auto.text Either \code{TRUE} (default) or \code{FALSE}. If \code{TRUE}
#'   titles and axis labels will automatically try and format pollutant names
#'   and units properly e.g.  by subscripting the \sQuote{2} in NO2.
#' @param alpha The alpha transparency used for plotting confidence intervals. 0
#'   is fully transparent and 1 is opaque. The default is 0.4
#' @param month.last Should the order of the plots be changed so the plot
#'   showing monthly means be the last plot for a logical hierarchy of averaging
#'   periods?
#' @param plot Should a plot be produced? \code{FALSE} can be useful when
#'   analysing data to extract plot components and plotting them in other ways.
#' @param ... Other graphical parameters passed onto \code{lattice:xyplot} and
#'   \code{cutData}. For example, in the case of \code{cutData} the option
#'   \code{hemisphere = "southern"}.
#'
#' @import lattice
#' @export
#' @return an [openair][openair-package] object. The four components of
#'   timeVariation are: \code{day.hour}, \code{hour}, \code{day} and
#'   \code{month}. Associated data.frames can be extracted directly using the
#'   \code{subset} option, e.g. as in \code{plot(object, subset = "day.hour")},
#'   \code{summary(output, subset = "hour")}, etc., for \code{output <-
#'   timeVariation(mydata, "nox")}
#' @author David Carslaw
#' @family time series and trend functions
#' @examples
#'
#'
#' # basic use
#' timeVariation(mydata, pollutant = "nox")
#'
#' # for a subset of conditions
#' \dontrun{
#' timeVariation(subset(mydata, ws > 3 & wd > 100 & wd < 270),
#' pollutant = "pm10", ylab = "pm10 (ug/m3)")
#' }
#'
#' # multiple pollutants with concentrations normalised
#' \dontrun{timeVariation(mydata, pollutant = c("nox", "co"), normalise = TRUE)}
#'
#' # show BST/GMT variation (see ?cutData for more details)
#' # the NOx plot shows the profiles are very similar when expressed in
#' # local time, showing that the profile is dominated by a local source
#' # that varies by local time and not by GMT i.e. road vehicle emissions
#'
#' \dontrun{timeVariation(mydata, pollutant = "nox", type = "dst", local.tz = "Europe/London")}
#'
#' ## In this case it is better to group the results for clarity:
#' \dontrun{timeVariation(mydata, pollutant = "nox", group = "dst", local.tz = "Europe/London")}
#'
#' # By contrast, a variable such as wind speed shows a clear shift when
#' #  expressed in local time. These two plots can help show whether the
#' #  variation is dominated by man-made influences or natural processes
#'
#' \dontrun{timeVariation(mydata, pollutant = "ws", group = "dst", local.tz = "Europe/London")}
#'
#' ## It is also possible to plot several variables and set type. For
#' ## example, consider the NOx and NO2 split by levels of O3:
#'
#' \dontrun{timeVariation(mydata, pollutant = c("nox", "no2"), type = "o3", normalise = TRUE)}
#'
#' ## difference in concentrations
#' \dontrun{timeVariation(mydata, poll= c("pm25", "pm10"), difference = TRUE)}
#'
#' # It is also useful to consider how concentrations vary by
#' # considering two different periods e.g. in intervention
#' # analysis. In the following plot NO2 has clearly increased but much
#' # less so at weekends - perhaps suggesting vehicles other than cars
#' # are important because flows of cars are approximately invariant by
#' # day of the week
#'
#' \dontrun{
#' mydata <- splitByDate(mydata, dates= "1/1/2003", labels = c("before Jan. 2003", "After Jan. 2003"))
#' timeVariation(mydata, pollutant = "no2", group = "split.by", difference = TRUE)
#' }
#'
#' ## sub plots can be extracted from the openair object
#' \dontrun{
#' myplot <- timeVariation(mydata, pollutant = "no2")
#' plot(myplot, subset = "day.hour") # top weekday and plot
#' }
#'
#' ## individual plots
#' ## plot(myplot, subset="day.hour") for the weekday and hours subplot (top)
#' ## plot(myplot, subset="hour") for the diurnal plot
#' ## plot(myplot, subset="day") for the weekday plot
#' ## plot(myplot, subset="month") for the monthly plot
#'
#' ## numerical results (mean, lower/upper uncertainties)
#' ## myplot$data$day.hour # the weekday and hour data set
#' ## summary(myplot, subset = "hour") #summary of hour data set
#' ## head(myplot, subset = "day") #head/top of day data set
#' ## tail(myplot, subset = "month") #tail/top of month data set
#'
#' ## plot quantiles and median
#' \dontrun{
#' timeVariation(mydata, stati="median", poll="pm10", col = "firebrick")
#'
#' ## with different intervals
#' timeVariation(mydata, stati="median", poll="pm10", conf.int = c(0.75, 0.99),
#' col = "firebrick")
#' }
#'
timeVariation <- function(mydata, pollutant = "nox", local.tz = NULL,
                          normalise = FALSE, xlab = c(
                            "hour", "hour", "month",
                            "weekday"
                          ), name.pol = pollutant,
                          type = "default", group = NULL, difference = FALSE,
                          statistic = "mean", conf.int = 0.95, B = 100, ci = TRUE, cols = "hue",
                          ref.y = NULL, key = NULL, key.columns = 1, start.day = 1,
                          panel.gap = 0.2,
                          auto.text = TRUE, alpha = 0.4, month.last = FALSE, plot = TRUE,
                          ...) {

  ## get rid of R check annoyances
  variable <- NULL

  ## greyscale handling
  if (length(cols) == 1 && cols == "greyscale") {
    trellis.par.set(list(strip.background = list(col = "white")))
  }

  ## set graphics
  current.strip <- trellis.par.get("strip.background")
  current.font <- trellis.par.get("fontsize")

  ## reset graphic parameters
  on.exit(trellis.par.set(

    fontsize = current.font
  ))

  ## extra.args setup
  extra.args <- list(...)

  ## label controls
  ## xlab handled in formals and code because unique
  extra.args$ylab <- if ("ylab" %in% names(extra.args)) {
    quickText(extra.args$ylab, auto.text)
  } else {
    quickText(paste(pollutant, collapse = ", "), auto.text)
  }

  extra.args$main <- if ("main" %in% names(extra.args)) {
    quickText(extra.args$main, auto.text)
  } else {
    quickText("", auto.text)
  }

  if ("fontsize" %in% names(extra.args)) {
    trellis.par.set(fontsize = list(text = extra.args$fontsize))
  }

  if (statistic == "median" && missing(conf.int)) conf.int <- c(0.75, 0.95)

  ## sub heading stat info
  if (statistic == "mean") {
    sub.text <- paste(
      "mean and ", 100 * conf.int[1], "% confidence interval in mean",
      sep = ""
    )
  } else {
    if (length(conf.int) == 1L) {
      sub.text <- paste(
        "median and ", 100 * (1 - conf.int[1]), "/", 100 * conf.int[1],
        "th quantiles",
        sep = ""
      )
    } else {
      sub.text <- paste(
        "median, ", 100 * (1 - conf.int[1]), "/", 100 * conf.int[1],
        " and ", 100 * (1 - conf.int[2]), "/", 100 * conf.int[2],
        "th quantiles",
        sep = ""
      )
    }
  }


  extra.args$sub <- if ("sub" %in% names(extra.args)) {
    quickText(extra.args$sub, auto.text)
  } else {
    quickText(sub.text, auto.text)
  }

  extra.args$lwd <- if ("lwd" %in% names(extra.args)) extra.args$lwd else 2

  ylim.handler <- if ("ylim" %in% names(extra.args)) {
    FALSE
  } else {
    TRUE
  }


  ## if user supplies separate ylims for each plot
  ylimList <- FALSE

  if ("ylim" %in% names(extra.args)) {
    if (is.list(extra.args$ylim)) {
      if (length(extra.args$ylim) != 4) stop("ylim should be a list of 4")
      ylim.list <- extra.args$ylim
      ylimList <- TRUE
    }
  }

  vars <- c("date", pollutant)

  ##  various check to make sure all the data are available ####################

  if (!missing(group) & length(pollutant) > 1) {
    stop("Can only have one pollutant with a grouping variable, or several pollutants and no grouping variable.")
  }

  ## only one type for now
  if (length(type) > 1) stop("Can only have one type for timeVariation.")

  if (type %in% pollutant) stop("Cannot have type the same as a pollutant name.")

  if (!missing(group)) {
    if (group %in% pollutant) stop("Cannot have group the same as a pollutant name.")
  }

  ## if differences between two pollutants are calculated
  if (difference) {
    if (missing(group)) {
      if (length(pollutant) != 2) {
        stop("Need to specify two pollutants to calculate their difference.")
      }
    }

    if (!missing(group)) {
      if (length(unique(na.omit(mydata[[group]]))) != 2) {
        stop("Need to specify two pollutants to calculate their difference.")
      }
    }
  }

  ## statistic check
  if (!statistic %in% c("mean", "median")) {
    statistic <- "mean"
    warning("statistic should be 'mean' or 'median',  setting it to 'mean'.")
  }
  ## #################################################################################

  ## check to see if type = "variable" (word used in code, so change)
  if (type == "variable") {
    mydata <- rename(mydata, c(variable = "tempVar"))
    type <- "tempVar"
  }

  ## if group is present, need to add that list of variables
  if (!missing(group)) {
    if (group %in% dateTypes) {
      vars <- unique(c(vars, "date"))
    } else {
      vars <- unique(c(vars, group))
    }
  }



  ## data checks
  mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)
  if (!missing(group)) mydata <- cutData(mydata, group, local.tz = local.tz, ...)
  mydata <- cutData(mydata, type, local.tz = local.tz, ...)

  ## put in local time if needed
  if (!is.null(local.tz)) attr(mydata$date, "tzone") <- local.tz

  ## title for overall and individual plots
  overall.main <- extra.args$main
  extra.args$main <- ""
  overall.sub <- extra.args$sub
  extra.args$sub <- ""

  ## labels for pollutants, can be user-defined, special handling when difference = TRUE
  poll.orig <- pollutant
  if (difference && missing(group)) pollutant <- c(pollutant, paste(pollutant[2], "-", pollutant[1]))

  if (missing(name.pol)) {
    mylab <- sapply(seq_along(pollutant), function(x)
      quickText(pollutant[x], auto.text))
  }

  if (!missing(name.pol)) {
    mylab <- sapply(seq_along(name.pol), function(x)
      quickText(name.pol[x], auto.text))
  }

  if (missing(group)) {

    mydata <- gather(mydata, key = variable, value = value, poll.orig)
    mydata$variable <- factor(mydata$variable, levels = pollutant) ## drop unused factor levels
  } else {
    ## group needs to be 'variable' and pollutant 'value'
    id <- which(names(mydata) == poll.orig)
    names(mydata)[id] <- "value"
    id <- which(names(mydata) == group)
    names(mydata)[id] <- "variable"

    mydata$variable <- factor(mydata$variable) ## drop unused factor levels
    the.names <- levels(mydata[["variable"]])
    if (difference) {
      the.names <- c(the.names, paste(
        levels(mydata$variable)[2], "-",
        levels(mydata$variable)[1]
      ))
    }
    mylab <- sapply(the.names, function(x) quickText(x, auto.text))
  }


  ## function to normalise
  divide.by.mean <- function(x) {
    Mean <- mean(x$Mean, na.rm = TRUE)
    x$Mean <- x$Mean / Mean
    x$Lower <- x$Lower / Mean
    x$Upper <- x$Upper / Mean
    x
  }

  if (normalise) extra.args$ylab <- "normalised level"

  ## get days in right order
  days <- format(ISOdate(2000, 1, 2:8), "%A")
  days.abb <- format(ISOdate(2000, 1, 2:8), "%a")
  if (start.day < 0 || start.day > 6) stop("start.day must be between 0 and 6.")

  if (start.day > 0) {
    day.ord <- c(days[(1 + start.day):7], days[1:(1 + start.day - 1)])
    day.ord.abb <- c(days.abb[(1 + start.day):7], days.abb[1:(1 + start.day - 1)])
  } else {
    day.ord <- days
    day.ord.abb <- days.abb
  }

  ## calculate temporal components
  mydata <- mutate(mydata,
                   wkday = wday(date, label = TRUE, abbr = FALSE),
                   wkday = ordered(wkday, levels = day.ord),
                   hour= hour(date),
                   mnth = month(date)
  )


  ## y range taking account of expanded uncertainties
  rng <- function(x) {

    ## if no CI information, just return
    if (all(is.na(x[, c("Lower", "Upper")]))) {
      lims <- NULL
      return(lims)
    }

    if (ci) {
      lims <- range(c(x$Lower, x$Upper), na.rm = TRUE)
      inc <- 0.04 * abs(lims[2] - lims[1])
      lims <- c(lims[1] - inc, lims[2] + inc)
    } else {
      lims <- range(c(x$Mean, x$Mean), na.rm = TRUE)
      inc <- 0.04 * abs(lims[2] - lims[1])
      lims <- c(lims[1] - inc, lims[2] + inc)
      if (diff(lims) == 0) lims <- NULL
    }
    lims
  }

  npol <- length(levels(mydata$variable)) ## number of pollutants

  if (difference) {
    npol <- 3 ## 3 pollutants if difference considered
    if (missing(group)) poll1 <- pollutant[1]
    poll2 <- pollutant[2]
    if (!missing(group)) poll1 <- levels(mydata$variable)[1]
    poll2 <- levels(mydata$variable)[2]
  }

  ## number of columns for key
  if (missing(key.columns)) key.columns <- npol

  myColors <- openColours(cols, npol)

  ## for individual plot keys - useful if only one of the plots is extracted after printing
  if (!is.null(key)) {
    key <- list(
      rectangles = list(col = myColors[1:npol], border = NA), title = "",
      text = list(lab = mylab), space = "bottom", columns = key.columns,
      lines.title = 1
    )

    extra.args$main <- overall.main
  }

  # data frame of confidence intervals
  conf_int <- data.frame(ci = conf.int)


  ## hour ############################################################################

  if (difference) {
    data.hour <- errorDiff(
      mydata,
      vars = "hour", type = type, poll1 = poll1,
      poll2 = poll2, B = B, conf.int = conf.int
    )
  } else {
    data.hour <- group_by(conf_int, ci) %>%
      do(proc(
        .$ci, mydata,
        vars = "hour", pollutant, type, B = B,
        statistic = statistic
      ))
  }


  if (normalise) {
    data.hour <- group_by(data.hour, variable) %>%
      do(divide.by.mean(.))
  }

  if (is.null(xlab[2]) | is.na(xlab[2])) xlab[2] <- "hour"

  ## proper names of labelling ##########################################################
  if (type != "default") {
    stripName <- sapply(levels(factor(data.hour[[type]])), function(x) quickText(x, auto.text))
    strip <- strip.custom(factor.levels = stripName)
  } else {
    strip <- FALSE
  }
  ## ###################################################################################

  temp <- paste(type, collapse = "+")
  myform <- formula(paste("Mean ~ hour | ", temp, sep = ""))

  ## ylim hander
  if (ylim.handler) {
    extra.args$ylim <- rng(data.hour)
  }

  ## user supplied separate ylim
  if (ylimList) {
    extra.args$ylim <- ylim.list[[1]]
  }

  ## plot
  xy.args <- list(
    x = myform, data = data.hour, groups = data.hour$variable,
    as.table = TRUE,
    xlab = xlab[2],
    xlim = c(0, 23),
    strip = strip,
    par.strip.text = list(cex = 0.8),
    key = key,
    scales = list(x = list(at = c(0, 6, 12, 18, 23))),
    par.settings = simpleTheme(col = myColors),
    panel = function(x, y, ...) {
      panel.grid(-1, 0)
      panel.abline(v = c(0, 6, 12, 18, 23), col = "grey85")
      panel.superpose(
        x, y, ...,
        panel.groups = function(x, y, col.line, type,
                                group.number, subscripts, ...) {
          if (difference) panel.abline(h = 0, lty = 5)
          ## plot once
          id <- which(data.hour$ci[subscripts] == data.hour$ci[1])
          panel.xyplot(
            x[id], y[id],
            type = "l",
            col.line = myColors[group.number], ...
          )

          if (ci) {
            mkpoly(
              data.hour[subscripts, ],
              x = "hour", y = "Mean",
              group.number, myColors, alpha
            )
          }
          ## reference line(s)
          if (!is.null(ref.y)) do.call(panel.abline, ref.y)
        }
      )
    }
  )

  ## reset for extra.args
  xy.args <- listUpdate(xy.args, extra.args)

  ## plot
  hour <- do.call(xyplot, xy.args)

  ## weekday ############################################################################

  if (difference) {
    data.weekday <- errorDiff(
      mydata,
      vars = "wkday", type = type, poll1 = poll1,
      poll2 = poll2, B = B, conf.int = conf.int
    )
  } else {
    data.weekday <- group_by(conf_int, ci) %>%
      do(proc(
        .$ci, mydata,
        vars = "wkday", pollutant, type, B = B,
        statistic = statistic
      ))
  }

  if (normalise) {
    data.weekday <- group_by(data.weekday, variable) %>%
      do(divide.by.mean(.))
  }

  data.weekday$wkday <- ordered(data.weekday$wkday, levels = day.ord)

  data.weekday$wkday <- as.numeric(as.factor(data.weekday$wkday))

  if (is.null(xlab[4]) | is.na(xlab[4])) xlab[4] <- "weekday"

  temp <- paste(type, collapse = "+")
  myform <- formula(paste("Mean ~ wkday | ", temp, sep = ""))

  ## ylim hander
  if (ylim.handler) {
    extra.args$ylim <- rng(data.weekday)
  }

  ## user supplied separate ylim
  if (ylimList) {
    extra.args$ylim <- ylim.list[[2]]
  }

  ## plot
  xy.args <- list(
    x = myform, data = data.weekday, groups = data.weekday$variable,
    as.table = TRUE,
    par.settings = simpleTheme(col = myColors, pch = 16),
    scales = list(x = list(at = 1:7, labels = day.ord.abb)),
    xlab = xlab[4],
    strip = strip,
    par.strip.text = list(cex = 0.8),
    key = key,
    panel = function(x, y, ...) {
      panel.grid(-1, 0)
      panel.abline(v = 1:7, col = "grey85")
      panel.superpose(
        x, y, ...,
        panel.groups = function(x, y, col.line, type, group.number,
                                subscripts, ...) {
          if (difference) panel.abline(h = 0, lty = 5)

          ## only plot median once if 2 conf.int
          id <- which(data.weekday$ci[subscripts] == data.weekday$ci[1])
          panel.xyplot(
            x[id], y[id],
            type = "l",
            col.line = myColors[group.number], ...
          )

          if (ci) {
            mkrect(
              data.weekday[subscripts, ],
              x = "wkday",
              y = "Mean", group.number, myColors, alpha
            )
          }
          ## refrence line(s)
          if (!is.null(ref.y)) do.call(panel.abline, ref.y)
        }
      )
    }
  )
  ## reset for extra.args
  xy.args <- listUpdate(xy.args, extra.args)

  ## plot
  day <- do.call(xyplot, xy.args)

  ## month ############################################################################

  if (difference) {
    data.month <- errorDiff(
      mydata,
      vars = "mnth", type = type, poll1 = poll1,
      poll2 = poll2, B = B, conf.int = conf.int
    )
  } else {
    data.month <- group_by(conf_int, ci) %>%
      do(proc(
        .$ci, mydata,
        vars = "mnth", pollutant, type, B = B,
        statistic = statistic
      ))
  }

  if (normalise) {
    data.month <- group_by(data.month, variable) %>%
      do(divide.by.mean(.))
  }


  if (is.null(xlab[3]) | is.na(xlab[3])) xlab[3] <- "month"

  temp <- paste(type, collapse = "+")
  myform <- formula(paste("Mean ~ mnth | ", temp, sep = ""))

  ## ylim hander
  if (ylim.handler) {
    extra.args$ylim <- rng(data.month)
  }

  ## user supplied separate ylim
  if (ylimList) {
    extra.args$ylim <- ylim.list[[3]]
  }

  ## plot
  xy.args <-
    list(
      x = myform, data = data.month, groups = data.month$variable,
      as.table = TRUE,
      xlab = xlab[3],
      xlim = c(0.5, 12.5),
      key = key,
      strip = strip,
      par.strip.text = list(cex = 0.8),
      par.settings = simpleTheme(col = myColors, pch = 16),
      scales = list(x = list(
        at = 1:12,
        labels = substr(format(
          seq(
            as_date("2000-01-01"),
            as_date("2000-12-31"), "month"
          ),
          "%B"
        ), 1, 1)
      )),
      panel = function(x, y, ...) {
        panel.grid(-1, 0)
        panel.abline(v = 1:12, col = "grey85")
        panel.superpose(
          x, y, ...,
          panel.groups = function(x, y, col.line, type,
                                  group.number, subscripts, ...) {
            if (difference) panel.abline(h = 0, lty = 5)

            ## a line won't work for a single point
            pltType <- "l"
            if (length(subscripts) == 1) pltType <- "p"

            ## only plot median once if 2 conf.int
            id <- which(data.month$ci[subscripts] == data.month$ci[1])

            ## lines not needed if split by season
            if (group != "season" || is.null(group)) {
              panel.xyplot(
                x[id], y[id],
                type = pltType,
                col.line = myColors[group.number], ...
              )
            }

            if (ci) {
              mkrect(
                data.month[subscripts, ],
                x = "mnth",
                y = "Mean", group.number, myColors, alpha
              )
            }
            ## refrence line(s)
            if (!is.null(ref.y)) do.call(panel.abline, ref.y)
          }
        )
      }
    )

  ## reset for extra.args
  xy.args <- listUpdate(xy.args, extra.args)

  ## plot
  month <- do.call(xyplot, xy.args)

  ## day and hour ############################################################################

  if (difference) {
    data.day.hour <- errorDiff(
      mydata,
      vars = "day.hour", type = type, poll1 = poll1,
      poll2 = poll2, B = B, conf.int = conf.int
    )
  } else {
    data.day.hour <- group_by(conf_int, ci) %>%
      do(proc(
        .$ci, mydata,
        vars = "day.hour", pollutant, type, B = B,
        statistic = statistic
      ))
  }

  if (normalise) {
    data.day.hour <- group_by(data.day.hour, variable) %>%
      do(divide.by.mean(.))
  }

  ids <- which(is.na(data.day.hour$Lower)) ## missing Lower ci, set to mean

  data.day.hour$Lower[ids] <- data.day.hour$Mean[ids]
  ids <- which(is.na(data.day.hour$Upper)) ## missing Upper ci, set to mean
  data.day.hour$Upper[ids] <- data.day.hour$Mean[ids]

  if (is.null(xlab[1]) | is.na(xlab[1])) xlab[1] <- "hour"

  ## proper names of labelling #####################################################################

  strip <- strip.custom(par.strip.text = list(cex = 0.8))


  if (type == "default") {
    strip.left <- FALSE
    layout <- c(length(unique(mydata$wkday)), 1)
  } else { ## two conditioning variables

    stripName <- sapply(levels(factor(data.day.hour[[type]])), function(x) quickText(x, auto.text))
    strip.left <- strip.custom(factor.levels = stripName)
    layout <- NULL
  }
  ## #################################################################################################

  temp <- paste(type, collapse = "+")
  if (type == "default") {
    myform <- formula("Mean ~ hour | wkday")
  } else {
    myform <- formula(paste("Mean ~ hour | wkday *", temp, sep = ""))
  }

  ## ylim hander
  if (ylim.handler) {
    extra.args$ylim <- rng(data.day.hour)
  }

  ## user supplied separate ylim
  if (ylimList) {
    extra.args$ylim <- ylim.list[[4]]
  }


  ## plot
  xy.args <- list(
    x = myform, data = data.day.hour, groups = data.day.hour$variable,
    as.table = TRUE,
    xlim = c(0, 23),
    xlab = xlab[1],
    layout = layout,
    par.settings = simpleTheme(col = myColors),
    between = list(x = panel.gap),
    scales = list(x = list(at = c(0, 6, 12, 18, 23))),
    key = key,
    strip = strip,
    strip.left = strip.left,
    par.strip.text = list(cex = 0.8),
    panel = function(x, y, ...) {
      panel.grid(-1, 0)
      panel.abline(v = c(0, 6, 12, 18, 23), col = "grey85")
      panel.superpose(
        x, y, ...,
        panel.groups = function(x, y, col.line, type, group.number,
                                subscripts, ...) {
          if (difference) panel.abline(h = 0, lty = 5)

          ## only plot median once if 2 conf.int
          id <- which(data.day.hour$ci[subscripts] == data.day.hour$ci[1])

          panel.xyplot(
            x[id], y[id],
            type = "l",
            col.line = myColors[group.number], ...
          )

          if (ci) {
            mkpoly(
              data.day.hour[subscripts, ],
              x = "hour",
              y = "Mean", group.number, myColors, alpha
            )
          }
          ## refrence line(s)
          if (!is.null(ref.y)) do.call(panel.abline, ref.y)
        }
      )
    }
  )

  ## reset for extra.args
  xy.args <- listUpdate(xy.args, extra.args)

  ## plot
  day.hour <- do.call(xyplot, xy.args)

  subsets <- c("day.hour", "hour", "day", "month")

  ## this adjusts the space for the title to 2 lines (approx) if \n in title
  if (length(grep("atop", overall.main) == 1)) {
    y.upp <- 0.95
    y.dwn <- 0.05
  } else {
    y.upp <- 0.975
    y.dwn <- 0.025
  }

  main.plot <- function(...) {
    if (type == "default") {
      print(update(
        day.hour,
        key = list(
          rectangles = list(col = myColors[1:npol], border = NA),
          text = list(lab = mylab), space = "bottom",
          columns = key.columns,
          title = "", lines.title = 1
        )
      ), position = c(0, 0.5, 1, y.upp), more = TRUE)
    } else {
      print(update(
        useOuterStrips(day.hour, strip = strip, strip.left = strip.left),
        key = list(
          rectangles = list(col = myColors[1:npol], border = NA),
          text = list(lab = mylab), space = "bottom", columns = key.columns,
          title = "", lines.title = 1
        )
      ), position = c(0, 0.5, 1, y.upp), more = TRUE)
    }

    # Build the plot panels in different orders
    if (!month.last) {
      # The original plot orders
      print(hour, position = c(0, y.dwn, 0.33, 0.53), more = TRUE)
      print(month, position = c(0.33, y.dwn, 0.66, 0.53), more = TRUE)
      print(day, position = c(0.66, y.dwn, 1, 0.53))
    } else {
      # Move around the plot order so they follow a logical hierarchy of averaging
      # periods hour-day-month
      print(hour, position = c(0, y.dwn, 0.33, 0.53), more = TRUE)
      print(day, position = c(0.33, y.dwn, 0.66, 0.53), more = TRUE)
      print(month, position = c(0.66, y.dwn, 1, 0.53))
    }

    ## use grid to add an overall title
    grid.text(overall.main, 0.5, y.upp, gp = gpar(fontsize = 14))
    grid.text(overall.sub, 0.5, y.dwn, gp = gpar(fontsize = 12))
  }

  ind.plot <- function(x, ...) {
    plot(update(x, key = list(
      rectangles = list(col = myColors[1:npol], border = NA),
      text = list(lab = mylab), space = "top", columns = key.columns
    )), ...)
  }

  if (plot) main.plot()
  output <- list(
    plot = list(day.hour, hour, day, month, subsets = subsets),
    data = list(data.day.hour, data.hour, data.weekday, data.month, subsets = subsets),
    call = match.call(),
    main.plot = main.plot, ind.plot = ind.plot
  )
  names(output$data)[1:4] <- subsets
  names(output$plot)[1:4] <- subsets
  class(output) <- "openair"

  invisible(output)
}

proc <- function(conf.int = conf.int, mydata, vars = "day.hour", pollutant, type, B = B,
                 statistic = statistic) {

  ## get rid of R check annoyances
  variable <- value <- NULL

  if (statistic == "mean") {
    stat <- bootMean
  } else {
    stat <- median_hilow
  }

  summary.values <- function(conf.int = conf.int, mydata, vars = vars, FUN, type = type, B = B,
                             statistic = statistic) {
    if (vars == "hour") myform <- formula(paste("value ~ variable + hour +", type))

    if (vars == "day.hour") myform <- formula(paste("value ~ variable + wkday + hour +", type))

    if (vars == "wkday") myform <- formula(paste("value ~ variable + wkday +", type))

    if (vars == "mnth") myform <- formula(paste("value ~ variable + mnth +", type))

    mydata <- aggregate(
      myform,
      data = mydata, FUN, B = B, statistic = statistic,
      conf.int = conf.int
    )
    mydata
  }

  ## function to calculate statistics dealing with wd properly
  if (any(!pollutant %in% "wd")) {
    data1 <- subset(mydata, variable != "wd")
    data1 <- summary.values(
      data1, vars, stat, type,
      B = B, statistic = statistic,
      conf.int = conf.int
    )
    data1 <- data.frame(subset(data1, select = -value), data1$value)
  }

  if ("wd" %in% pollutant) {
    if (length(pollutant) > 1) mydata <- subset(mydata, variable == "wd")
    data2 <- summary.values(
      conf.int, mydata, vars, wd.smean.normal, type,
      B = B, statistic = statistic
    )
    data2 <- data.frame(subset(data2, select = -value), data2$value)
  }

  if (length(pollutant) > 1 & "wd" %in% pollutant) data2 <- bind_rows(data1, data2)

  if (!"wd" %in% pollutant) data2 <- data1

  if (length(pollutant) == 1 & "wd" %in% pollutant) data2 <- data2

  data2$ci <- conf.int
  data2
}

wd.smean.normal <- function(wd, B = B, statistic, conf.int) {
  ## function to calculate mean and 95% CI of the mean for wd

  u <- mean(sin(pi * wd / 180), na.rm = TRUE)
  v <- mean(cos(pi * wd / 180), na.rm = TRUE)
  Mean <- as.vector(atan2(u, v) * 360 / 2 / pi)
  ids <- which(Mean < 0) ## ids where wd < 0
  Mean[ids] <- Mean[ids] + 360

  ## to calculate SD and conf int, need to know how much the angle changes from one point
  ## to the next. Also cannot be more than 180 degrees. Example change from 350 to 10 is not
  ## 340 but 20.
  wd.diff <- diff(wd)
  ids <- which(wd.diff < 0)
  wd.diff[ids] <- wd.diff[ids] + 360
  ids <- which(wd.diff > 180)
  wd.diff[ids] <- abs(wd.diff[ids] - 360)

  if (statistic == "mean") {
    intervals <- bootMean(wd.diff, B = B, conf.int)
  } else {
    intervals <- median_hilow(wd.diff, conf.int)
  }
  Lower <- intervals[2]
  names(Lower) <- NULL

  Upper <- intervals[3]
  names(Upper) <- NULL
  diff.wd <- (Upper - Lower) / 2

  c(Mean = Mean, Lower = Mean - diff.wd, Upper = Mean + diff.wd)
}

errorDiff <- function(mydata, vars = "day.hour", poll1, poll2, type, B = B,
                      conf.int = conf.int) {

  ## bootstrap mean difference confidence intervals
  ## rearrange data

  # it could be dates duplicate e.g. run function over several sites


  if (anyDuplicated(mydata$date) > 0) {

    mydata$rowid <- 1:nrow(mydata)

  }

  mydata <- pivot_wider(mydata,
                        names_from = "variable",
                        values_from = "value")


  if (vars == "hour") splits <- c("hour", type)
  if (vars == "day.hour") splits <- c("hour", "wkday", type)
  if (vars == "wkday") splits <- c("wkday", type)
  if (vars == "mnth") splits <- c("mnth", type)

  ## warnings from dplyr seem harmless FIXME
  res <- mydata %>%
    group_by(across(splits)) %>%
    do(bootMeanDiff(., x = poll1, y = poll2, B = B, na.rm = TRUE))

  # make sure we keep the order correct
  res$variable <- ordered(res$variable, levels = res$variable[1:3])

  res$ci <- conf.int[1]
  res
}


## function to calculate median and lower/upper quantiles
median_hilow <- function(x, conf.int = 0.95, na.rm = TRUE, ...) {
  quant <- quantile(x, probs = c(0.5, (1 - conf.int), conf.int), na.rm = na.rm)
  names(quant) <- c("Mean", "Lower", "Upper")
  quant
}

## make poly function ########################################
mkpoly <- function(dat, x = "hour", y = "Mean", group.number, myColors, alpha) {
  len <- length(unique(dat$ci)) ## number of confidence intervals to consider
  ci <- sort(unique(dat$ci))

  fac <- 2

  if (len == 1L) id1 <- which(dat$ci == ci[1]) ## ids of higher band

  if (len == 2L) {
    id1 <- which(dat$ci == ci[2])
    id2 <- which(dat$ci == ci[1])
    fac <- 1
  }

  poly.na(
    dat[[x]][id1], dat$Lower[id1], dat[[x]][id1], dat$Upper[id1],
    group.number, myColors,
    alpha = fac * alpha / 2
  )

  if (len == 2L) {
    poly.na(
      dat[[x]][id2], dat$Lower[id2], dat[[x]][id2], dat$Upper[id2],
      group.number, myColors,
      alpha = alpha
    )
  }
}

mkrect <- function(dat, x = "wkday", y = "Mean", group.number, myColors, alpha) {
  len <- length(unique(dat$ci)) ## number of confidence intervals to consider
  ci <- sort(unique(dat$ci))

  fac <- 2

  if (len == 1L) id1 <- which(dat$ci == ci[1]) ## ids of higher band

  if (len == 2L) {
    id1 <- which(dat$ci == ci[2])
    id2 <- which(dat$ci == ci[1])
    fac <- 1
  }

  panel.rect(
    dat[[x]][id1] - 0.15 * fac, dat$Lower[id1], dat[[x]][id1] + 0.15 * fac,
    dat$Upper[id1],
    fill = myColors[group.number],
    border = NA, alpha = fac * alpha / 2
  )

  if (len == 2L) {
    panel.rect(
      dat[[x]][id2] - 0.3, dat$Lower[id2], dat[[x]][id2] + 0.3,
      dat$Upper[id2],
      fill = myColors[group.number],
      border = NA, alpha = alpha
    )
  }
}
davidcarslaw/openair documentation built on April 23, 2024, 4:08 p.m.