R/timeVariation.R

##' Diurnal, day of the week and monthly variation
##'
##' Plots the diurnal, day of the week and monthly variation for
##' different variables, typically pollutant concentrations. Four
##' separate plots are produced.
##'
##' The variation of pollutant concentrations by hour of the day and
##' day of the week etc. can reveal many interesting features that
##' relate to source types and meteorology. For traffic sources, there
##' are often important differences in the way vehicles vary by
##' vehicles type e.g. less heavy vehicles at weekends.
##'
##' The \code{timeVariation} function makes it easy to see how
##' concentrations (and many other variable types) vary by hour of the
##' day and day of the week.
##'
##' The plots also show the 95\% confidence intervals in the mean. The
##' 95\% confidence intervals in the mean are calculated through
##' bootstrap simulations, which will provide more robust estimates of
##' the confidence intervals (particularly when there are relatively
##' few data).
##'
##' The function can handle multiple pollutants and uses the flexible
##' \code{type} option to provide separate panels for each 'type' ---
##' see \code{cutData} for more details. \code{timeVariation} can also
##' accept a \code{group} option which is useful if data are
##' stacked. This will work in a similar way to having multiple
##' pollutants in separate columns.
##'
##' The user can supply their own \code{ylim} e.g. \code{ylim = c(0,
##' 200)} that will be used for all plots. \code{ylim} can also be a
##' list of length four to control the y-limits on each individual
##' plot e.g. \code{ylim = list(c(-100,500), c(200, 300), c(-400,400),
##' c(50,70))}. These pairs correspond to the hour, weekday, month and
##' day-hour plots respectively.
##'
##' The option \code{difference} will calculate the difference in
##' means of two pollutants together with bootstrap estimates of the
##' 95\% confidence intervals in the difference in the mean. This works
##' in two ways: either two pollutants are supplied in separate
##' columns e.g. \code{pollutant = c("no2", "o3")}, or there are two
##' unique values of \code{group}. The difference is calculated as the
##' second pollutant minus the first and is labelled as
##' such. Considering differences in this way can provide many useful
##' insights and is particularly useful for model evaluation when
##' information is needed about where a model differs from
##' observations by many different time scales. The manual contains
##' various examples of using \code{difference = TRUE}.
##'
##' Note also that the \code{timeVariation} function works well on a
##' subset of data and in conjunction with other plots. For example, a
##' \code{\link{polarPlot}} may highlight an interesting feature for a
##' particular wind speed/direction range. By filtering for those
##' conditions \code{timeVariation} can help determine whether the
##' temporal variation of that feature differs from other features ---
##' and help with source identification.
##'
##' In addition, \code{timeVariation} will work well with other variables if
##' available. Examples include meteorological and traffic flow data.
##'
##' Depending on the choice of statistic, a subheading is added. Users
##' can control the text in the subheading through the use of
##' \code{sub} e.g. \code{sub = ""} will remove any subheading.
##'
##' @param mydata A data frame of hourly (or higher temporal resolution data).
##'   Must include a \code{date} field and at least one variable to plot.
##' @param pollutant Name of variable to plot. Two or more pollutants can be
##'   plotted, in which case a form like \code{pollutant = c("nox", "co")}
##'   should be used.
##' @param local.tz Should the results be calculated in local time
##' that includes a treatment of daylight savings time (DST)? The
##' default is not to consider DST issues, provided the data were
##' imported without a DST offset. Emissions activity tends to occur
##' at local time e.g. rush hour is at 8 am every day. When the clocks
##' go forward in spring, the emissions are effectively released into
##' the atmosphere typically 1 hour earlier during the summertime
##' i.e. when DST applies. When plotting diurnal profiles, this has
##' the effect of \dQuote{smearing-out} the concentrations. Sometimes,
##' a useful approach is to express time as local time. This
##' correction tends to produce better-defined diurnal profiles of
##' concentration (or other variables) and allows a better comparison
##' to be made with emissions/activity data. If set to \code{FALSE}
##' then GMT is used. Examples of usage include \code{local.tz =
##' "Europe/London"}, \code{local.tz = "America/New_York"}. See
##' \code{cutData} and \code{import} for more details.
##' @param normalise Should variables be normalised? The default is
##'   \code{FALSE}. If \code{TRUE} then the variable(s) are divided by their
##'   mean values. This helps to compare the shape of the diurnal trends for
##'   variables on very different scales.
##' @param xlab x-axis label; one for each sub-plot.
##' @param name.pol Names to be given to the pollutant(s). This is useful if
##'   you want to give a fuller description of the variables, maybe also
##'   including subscripts etc.
##' @param type \code{type} determines how the data are split
##' i.e. conditioned, and then plotted. The default is will produce a
##' single plot using the entire data. Type can be one of the built-in
##' types as detailed in \code{cutData} e.g. \dQuote{season},
##' \dQuote{year}, \dQuote{weekday} and so on. For example, \code{type
##' = "season"} will produce four plots --- one for each season.
##'
##' It is also possible to choose \code{type} as another variable in the data
##'   frame. If that variable is numeric, then the data will be split into four
##'   quantiles (if possible) and labelled accordingly. If type is an existing
##'   character or factor variable, then those categories/levels will be used
##'   directly. This offers great flexibility for understanding the variation
##'   of different variables and how they depend on one another.
##'
##' Only one \code{type} is allowed in\code{timeVariation}.
##' @param group This sets the grouping variable to be used. For example, if a
##'   data frame had a column \code{site} setting \code{group = "site"} will
##'   plot all sites together in each panel. See examples below.
##' @param difference If two pollutants are chosen then setting
##' \code{difference = TRUE} will also plot the difference in means
##' between the two variables as \code{pollutant[2] -
##' pollutant[1]}. Bootstrap 95\% confidence intervals of the
##' difference in means are also calculated. A horizontal dashed line
##' is shown at y = 0.
##' @param statistic Can be \dQuote{mean} (default) or
##' \dQuote{median}. If the statistic is \sQuote{mean} then the mean
##' line and the 95\% confidence interval in the mean are plotted by
##' default. If the statistic is \sQuote{median} then the median line
##' is plotted together with the 5/95 and 25/75th quantiles are
##' plotted. Users can control the confidence intervals with
##' \code{conf.int}.
##' @param conf.int The confidence intervals to be plotted. If
##' \code{statistic = "mean"} then the confidence intervals in the
##' mean are plotted. If \code{statistic = "median"} then the
##' \code{conf.int} and \code{1 - conf.int} \emph{quantiles} are
##' plotted. \code{conf.int} can be of length 2, which is most useful
##' for showing quantiles. For example \code{conf.int = c(0.75, 0.99)}
##' will yield a plot showing the median, 25/75 and 5/95th quantiles.
##' @param B Number of bootstrap replicates to use. Can be useful to
##' reduce this value when there are a large number of observations
##' available to increase the speed of the calculations without
##' affecting the 95\% confidence interval calculations by much.
##' @param ci Should confidence intervals be shown? The default is \code{TRUE}.
##'   Setting this to \code{FALSE} can be useful if multiple pollutants are
##'   chosen where over-lapping confidence intervals can over complicate plots.
##' @param cols Colours to be used for plotting. Options include
##' \dQuote{default}, \dQuote{increment}, \dQuote{heat}, \dQuote{jet}
##' and \code{RColorBrewer} colours --- see the \code{openair}
##' \code{openColours} function for more details. For user defined the
##' user can supply a list of colour names recognised by R (type
##' \code{colours()} to see the full list). An example would be
##' \code{cols = c("yellow", "green", "blue")}
##' @param ref.y A list with details of the horizontal lines to be
##' added representing reference line(s). For example, \code{ref.y =
##' list(h = 50, lty = 5)} will add a dashed horizontal line at
##' 50. Several lines can be plotted e.g. \code{ref.y = list(h = c(50,
##' 100), lty = c(1, 5), col = c("green", "blue"))}. See
##' \code{panel.abline} in the \code{lattice} package for more details
##' on adding/controlling lines.
##' @param key By default \code{timeVariation} produces four plots on
##' one page.  While it is useful to see these plots together, it is
##' sometimes necessary just to use one for a report. If \code{key} is
##' \code{TRUE}, a key is added to all plots allowing the extraction
##' of a single plot \emph{with} key. See below for an example.
##' @param key.columns Number of columns to be used in the key. With many
##'   pollutants a single column can make to key too wide. The user can thus
##'   choose to use several columns by setting \code{columns} to be less than
##'   the number of pollutants.
##' @param start.day What day of the week should the plots start on?
##' The user can change the start day by supplying an integer between
##' 0 and 6. Sunday = 0, Monday = 1, \ldots For example to start the
##' weekday plots on a Saturday, choose \code{start.day = 6}.
##' @param auto.text Either \code{TRUE} (default) or \code{FALSE}. If
##'   \code{TRUE} titles and axis labels will automatically try and format
##'   pollutant names and units properly e.g.  by subscripting the \sQuote{2} in NO2.
##' @param alpha The alpha transparency used for plotting confidence intervals.
##'   0 is fully transparent and 1 is opaque. The default is 0.4
##' @param ... Other graphical parameters passed onto \code{lattice:xyplot}
##'   and \code{cutData}. For example, in the case of \code{cutData} the option
##'   \code{hemisphere = "southern"}.
##'
##' @import ggplot2
##' @importFrom lubridate ymd
##' @export
##' @return As well as generating the plot itself, \code{timeVariation} also
##'   returns an object of class ``openair''. The object includes three main
##'   components: \code{call}, the command used to generate the plot;
##'   \code{data}, the data used to make the four components of the plot (or
##'   subplots); and \code{plot}, the associated subplots.  If retained, e.g.
##'   using \code{output <- timeVariation(mydata, "nox")}, this output can be
##'   used to recover the data, reproduce or rework the original plot or
##'   undertake further analysis.
##'
##' An openair output can be manipulated using a number of generic operations,
##'   including \code{print}, \code{plot} and \code{summary}.
##'
##' The four components of timeVariation are: \code{day.hour}, \code{hour},
##'   \code{day} and \code{month}. Associated data.frames can be extracted
##'   directly using the \code{subset} option, e.g. as in \code{plot(object,
##'   subset = "day.hour")}, \code{summary(output, subset = "hour")}, etc, for
##'   \code{output <- timeVariation(mydata, "nox")}
##' @author David Carslaw
##' @seealso \code{\link{polarPlot}}, \code{\link{linearRelation}}
##' @keywords methods
##' @examples
##'
##'
##' # basic use
##' timeVariation(mydata, pollutant = "nox")
##'
##' # for a subset of conditions
##' \dontrun{
##' timeVariation(subset(mydata, ws > 3 & wd > 100 & wd < 270),
##' pollutant = "pm10", ylab = "pm10 (ug/m3)")
##' }
##'
##' # multiple pollutants with concentrations normalised
##' \dontrun{timeVariation(mydata, pollutant = c("nox", "co"), normalise = TRUE)}
##'
##' # show BST/GMT variation (see ?cutData for more details)
##' # the NOx plot shows the profiles are very similar when expressed in
##' # local time, showing that the profile is dominated by a local source
##' # that varies by local time and not by GMT i.e. road vehicle emissions
##'
##' \dontrun{timeVariation(mydata, pollutant = "nox", type = "dst", local.tz = "Europe/London")}
##'
##' ## In this case it is better to group the results for clarity:
##' \dontrun{timeVariation(mydata, pollutant = "nox", group = "dst", local.tz = "Europe/London")}
##'
##' # By contrast, a variable such as wind speed shows a clear shift when
##' #  expressed in local time. These two plots can help show whether the
##' #  variation is dominated by man-made influences or natural processes
##'
##' \dontrun{timeVariation(mydata, pollutant = "ws", group = "dst", local.tz = "Europe/London")}
##'
##' ## It is also possible to plot several variables and set type. For
##' ## example, consider the NOx and NO2 split by levels of O3:
##'
##' \dontrun{timeVariation(mydata, pollutant = c("nox", "no2"), type = "o3", normalise = TRUE)}
##'
##' ## difference in concentrations
##' \dontrun{timeVariation(mydata, poll= c("pm25", "pm10"), difference = TRUE)}
##'
##' # It is also useful to consider how concentrations vary by
##' # considering two different periods e.g. in intervention
##' # analysis. In the following plot NO2 has clearly increased but much
##' # less so at weekends - perhaps suggesting vehicles other than cars
##' # are important because flows of cars are approximately invariant by
##' # day of the week
##'
##' \dontrun{
##' mydata <- splitByDate(mydata, dates= "1/1/2003", labels = c("before Jan. 2003", "After Jan. 2003"))
##' timeVariation(mydata, pollutant = "no2", group = "split.by", difference = TRUE)
##' }
##'
##' ## sub plots can be extracted from the openair object
##' \dontrun{
##' myplot <- timeVariation(mydata, pollutant = "no2")
##' plot(myplot, subset = "day.hour") # top weekday and plot
##' }
##'
##' ## individual plots
##' ## plot(myplot, subset="day.hour") for the weekday and hours subplot (top)
##' ## plot(myplot, subset="hour") for the diurnal plot
##' ## plot(myplot, subset="day") for the weekday plot
##' ## plot(myplot, subset="month") for the monthly plot
##'
##' ## numerical results (mean, lower/upper uncertainties)
##' ## results(myplot, subset = "day.hour") # the weekday and hour data set
##' ## summary(myplot, subset = "hour") #summary of hour data set
##' ## head(myplot, subset = "day") #head/top of day data set
##' ## tail(myplot, subset = "month") #tail/top of month data set
##'
##' ## plot quantiles and median
##' \dontrun{
##' timeVariation(mydata, stati="median", poll="pm10", col = "firebrick")
##'
##' ## with different intervals
##' timeVariation(mydata, stati="median", poll="pm10", conf.int = c(0.75, 0.99),
##' col = "firebrick")
##' }
##'
timeVariation <- function(mydata, pollutant = "nox", local.tz = NULL,
                          normalise = FALSE, xlab = c("hour", "hour", "month",
                                                      "weekday"), name.pol = pollutant,
                          type = "default", group = NULL, difference = FALSE,
                          statistic = "mean", conf.int = 0.95, B = 100, ci = TRUE, cols = "hue",
                          ref.y = NULL, key = NULL, key.columns = 1, start.day = 1,
                          auto.text = TRUE, alpha = 0.4, ...)  {
  
  ## get rid of R check annoyances
  variable = NULL
  
  ## Args setup
  Args <- list(...)
  
  ## label controls
  ## xlab handled in formals and code because unique
  Args$ylab <- if ("ylab" %in% names(Args))
    quickText(Args$ylab, auto.text) else
      quickText(paste(pollutant, collapse=", "), auto.text)
  
  Args$main <- if ("main" %in% names(Args))
    quickText(Args$main, auto.text) else quickText("", auto.text)
  
  
  if (statistic == "median" && missing(conf.int)) conf.int <- c(0.75, 0.95)
  
  ## sub heading stat info
  if (statistic == "mean") {
    sub.text <- paste("mean and ", 100 * conf.int[1], "% confidence interval in mean",
                      sep = "")
  } else {
    if (length(conf.int) == 1L) {
      sub.text <- paste("median and ", 100 * (1 - conf.int[1]), "/", 100 * conf.int[1],
                        "th quantiles", sep = "")
    } else {
      sub.text <- paste("median, ", 100 * (1 - conf.int[1]), "/", 100 * conf.int[1],
                        " and ", 100 * (1 - conf.int[2]), "/", 100 * conf.int[2],
                        "th quantiles", sep = "")
    }
  }
  
  
  Args$sub <- if ("sub" %in% names(Args))
    quickText(Args$sub, auto.text) else quickText(sub.text, auto.text)
  
  Args$lwd <- if ("lwd" %in% names(Args)) Args$lwd else 1.5
  
  ylim.handler <- if ("ylim" %in% names(Args))
    FALSE else TRUE
  
  
  ## if user supplies separate ylims for each plot
  ylimList <- FALSE
  
  if ("ylim" %in% names(Args)) {
    if (is.list(Args$ylim)) {
      if (length(Args$ylim) != 4) stop("ylim should be a list of 4")
      ylim.list <- Args$ylim
      ylimList <- TRUE
    }
  }
  
  vars <- c("date", pollutant)
  
  #  various check to make sure all the data are available ######################################
  if (!missing(group) & length(pollutant) > 1) {
    stop("Can only have one pollutant with a grouping variable, or several pollutants and no grouping variable.")}
  
  ## only one type for now
  if (length(type) > 1) stop("Can only have one type for timeVariation.")
  
  if (type %in% pollutant) stop("Cannot have type the same as a pollutant name.")
  
  if (!missing(group)) {
    if (group %in% pollutant) stop("Cannot have group the same as a pollutant name.")
  }
  
  ## if differences between two pollutants are calculated
  if (difference) {
    if (missing(group)) {
      
      if (length(pollutant) != 2)
        stop("Need to specify two pollutants to calculate their difference.")
    }
    
    if (!missing(group)) {
      if (length(unique(mydata[ , group])) != 2)
        stop("Need to specify two pollutants to calculate their difference.")
    }
  }
  
  ## statistic check
  if (!statistic %in% c("mean", "median")) {
    statistic <- "mean"
    warning("statistic should be 'mean' or 'median',  setting it to 'mean'.")
  }
  
  ## check to see if type = "variable" (word used in code, so change)
  if (type == "variable") {
    mydata <- rename(mydata, c(variable = "tempVar"))
    type <- "tempVar"
  }
  
  ## if group is present, need to add that list of variables
  if (!missing(group)){
    
    if (group %in%  dateTypes) {
      vars <- unique(c(vars, "date"))
    } else {
      vars <- unique(c(vars, group))
    }
  }
  
  
  
  ## data checks
  mydata <- checkPrep(mydata, vars, type, remove.calm = FALSE)
  
  if (!missing(group))  mydata <- cutData(mydata, group, local.tz = local.tz, ...)
  
  mydata <- cutData(mydata, type, local.tz = local.tz, ...)
  
  # if using a type, make sure there are no NA
  if (type != "default") {
    id <- which(is.na(mydata[[type]]))
    if (length(id) > 0) mydata <- mydata[-id, ]
  }
  
  ## put in local time if needed
  if (!is.null(local.tz)) attr(mydata$date, "tzone") <- local.tz
  
  ## title for overall and individual plots
  overall.main <- Args$main
  Args$main <- ""
  overall.sub <- Args$sub
  Args$sub <- ""
  
  ## labels for pollutants, can be user-defined, special handling when difference = TRUE
  poll.orig <- pollutant
  if (difference && missing(group)) pollutant <- c(pollutant, paste(pollutant[2], "-", pollutant[1]))
  
  if (missing(name.pol)) mylab <- sapply(seq_along(pollutant), function(x)
    quickText(pollutant[x], auto.text))
  
  if (!missing(name.pol)) mylab <- sapply(seq_along(name.pol), function(x)
    quickText(name.pol[x], auto.text))
  
  if (missing(group)) {
    
    mydata <- gather(mydata, key = variable, value = value, one_of(poll.orig))
    mydata$variable <- factor(mydata$variable, levels = pollutant)  ## drop unused factor levels
    
  } else {
    ## group needs to be 'variable' and pollutant 'value'
    id <- which(names(mydata) == poll.orig)
    names(mydata)[id] <- "value"
    id <- which(names(mydata) == group)
    names(mydata)[id] <- "variable"
    
    mydata$variable <- factor(mydata$variable)  ## drop unused factor levels
    the.names <- levels(mydata[["variable"]])
    
    if (difference) 
      the.names <- c(the.names, paste(levels(mydata$variable)[2], "-",
                                      levels(mydata$variable)[1]))
    
    mylab <-  sapply(the.names, function(x) quickText(x, auto.text))
  }
  
  
  ## function to normalise
  divide.by.mean <- function(x) {
    Mean <- mean(x$Mean, na.rm = TRUE)
    x$Mean <- x$Mean / Mean
    x$Lower <- x$Lower / Mean
    x$Upper <- x$Upper / Mean
    x
  }
  
  if (normalise) Args$ylab <- "normalised level"
  
  ## get days in right order
  days <- format(ISOdate(2000, 1, 2:8), "%A")
  days.abb <- format(ISOdate(2000, 1, 2:8), "%a")
  if (start.day < 0 || start.day > 6) stop ("start.day must be between 0 and 6.")
  
  if (start.day > 0) {
    day.ord <- c(days[(1 + start.day):7], days[1:(1 + start.day - 1)])
    day.ord.abb <- c(days.abb[(1 + start.day):7], days.abb[1:(1 + start.day - 1)])
  } else {
    day.ord <- days
    day.ord.abb <- days.abb
  }
  
  ## calculate temporal components
  mydata <- within(mydata, {
    wkday <- format(date, "%A")
    wkday <- ordered(wkday, levels = day.ord)
    hour <- as.numeric(format(date, "%H"))
    mnth <- as.numeric(format(date, "%m"))}
  )
  
  
  ## y range taking account of expanded uncertainties
  rng <- function(x) {
    
    ## if no CI information, just return
    if (all(is.na(x[, c("Lower", "Upper")]))) {
      lims <- NULL
      return(lims)
    }
    
    if (ci) {
      lims <- range(c(x$Lower, x$Upper), na.rm = TRUE)
      inc <- 0.04 * abs(lims[2] - lims[1])
      lims <- c(lims[1] - inc, lims[2] + inc)
    } else {
      lims <- range(c(x$Mean, x$Mean), na.rm = TRUE)
      inc <- 0.04 * abs(lims[2] - lims[1])
      lims <- c(lims[1] - inc, lims[2] + inc)
      if (diff(lims) == 0) lims <- NULL
    }
    lims
  }
  
  npol <- length(levels(mydata$variable)) ## number of pollutants
  
  if (difference) {
    npol <- 3 ## 3 pollutants if difference considered
    if (missing(group)) poll1 <- pollutant[1]; poll2 <- pollutant[2]
    if (!missing(group)) poll1 <- levels(mydata$variable)[1]; poll2 <- levels(mydata$variable)[2]
  }
  
  ## number of columns for key
  if (missing(key.columns)) key.columns <- npol
  
  myColors <- openColours(cols, npol)
  
  ## for individual plot keys - useful if only one of the plots is extracted after printing
  if (!is.null(key)) {
    key <- list(rectangles = list(col = myColors[1:npol], border = NA), title = "",
                text = list(lab = mylab),  space = "bottom", columns = key.columns,
                lines.title = 1)
    
    Args$main <- overall.main
  }
  
  # hour calculations ---------------------------------------------
 
  
  if (difference) {
    
    data.hour <- errorDiff(mydata, vars = "hour", type = type, poll1 = poll1,
                           poll2 = poll2, B = B, conf.int = conf.int)
    
  } else {
    
    data.hour <- plyr::ldply(conf.int, proc, mydata, vars = "hour", 
                             pollutant, type, B = B,
                             statistic = statistic)
  }
  
 
  
  if (normalise) data.hour <- group_by(data.hour, variable) %>%
    do(divide.by.mean(.))
  
  if (is.null(xlab[2]) | is.na(xlab[2])) xlab[2] <- "hour"
  
  temp <- paste(type, collapse = "+")
  myform <- formula(paste("Mean ~ hour | ", temp, sep = ""))
  
  ## ylim hander
  if (ylim.handler)
    Args$ylim <- rng(data.hour)
  
  ## user supplied separate ylim
  if (ylimList)
    Args$ylim <- ylim.list[[1]]
  
  ## plot
  
  
  pltHr <- ggplot(data.hour, aes(x = hour, y = Mean, ymin = Lower, ymax = Upper, 
                                 col = variable, fill = variable)) 
  
  # plot uncertainty intervals?
  if (ci) pltHr <- pltHr +
    geom_ribbon(alpha = alpha, colour = NA) 
  
  pltHr <- pltHr +
    geom_line(size = Args$lwd) +
    scale_x_continuous(breaks = c(0, 6, 12, 18, 23), 
                       expand = c(0.02, 0),
                       minor_breaks = seq(0, 23, 3)) +
    scale_colour_manual(values = openColours(cols, npol), 
                        breaks = levels(data.hour$variable), 
                        labels = mylab) +
    scale_fill_manual(values = openColours(cols, npol), 
                      breaks = levels(data.hour$variable),
                      labels = mylab) +
    theme(legend.position = "none") +
    ylab(Args$ylab) +
    xlab(xlab[1]) +
    ylim(Args$ylim)
  
  
  # weekday -------------------------------------------------------
  
  
  if (difference) {
    data.weekday <- errorDiff(mydata, vars = "wkday", type = type, poll1 = poll1,
                              poll2 = poll2, B = B, conf.int = conf.int)
  } else {
    data.weekday <- plyr::ldply(conf.int, proc, mydata, vars = "wkday", 
                                pollutant, type, B = B,
                                statistic = statistic)
  }
  
  if (normalise) data.weekday <-  group_by(data.weekday, variable) %>%
    do(divide.by.mean(.))
  
  data.weekday$wkday <- ordered(data.weekday$wkday, levels = day.ord)
  
  data.weekday$wkday <- as.numeric(as.factor(data.weekday$wkday))
  
  if (is.null(xlab[4]) | is.na(xlab[4])) xlab[4] <- "weekday"
  
  temp <- paste(type, collapse = "+")
  myform <- formula(paste("Mean ~ wkday | ", temp, sep = ""))
  
  ## ylim hander
  if (ylim.handler)
    Args$ylim <- rng(data.weekday)
  
  ## user supplied separate ylim
  if (ylimList)
    Args$ylim <- ylim.list[[2]]
  
  ## plot
  pltWkday <- ggplot(data.weekday, aes(x = wkday, y = Mean, 
                                       xmin = wkday - 0.4, xmax = wkday + 0.4,
                                       ymin = Lower, ymax = Upper, 
                                       col = variable, fill = variable)) 
  if (ci) pltWkday <- pltWkday +
    geom_rect(alpha = alpha, colour = NA) 
  
  pltWkday <- pltWkday +
    geom_line(size = Args$lwd) +
    scale_x_continuous(breaks = 1:7, 
                       expand = c(0.02, 0),
                       minor_breaks = seq(0, 7, 1),
                       labels = day.ord.abb) +
    scale_colour_manual(values = openColours(cols, npol), 
                        breaks = levels(data.hour$variable), 
                        labels = mylab) +
    scale_fill_manual(values = openColours(cols, npol), 
                      breaks = levels(data.hour$variable),
                      labels = mylab) +
    theme(legend.position = "none") +
    ylab(Args$ylab) +
    xlab(xlab[4]) +
    ylim(Args$ylim)
  
  ## month ############################################################################
  
  if (difference) {
    data.month <- errorDiff(mydata, vars = "mnth", type = type, poll1 = poll1,
                            poll2 = poll2, B = B, conf.int = conf.int)
  } else {
    data.month <- plyr::ldply(conf.int, proc, mydata, vars = "mnth", 
                              pollutant, type, B = B,
                              statistic = statistic)
  }
  
  if (normalise) data.month <-  group_by(data.month, variable) %>%
    do(divide.by.mean(.))
  
  
  if (is.null(xlab[3]) | is.na(xlab[3])) xlab[3] <- "month"
  
  temp <- paste(type, collapse = "+")
  myform <- formula(paste("Mean ~ mnth | ", temp, sep = ""))
  
  ## ylim hander
  if (ylim.handler)
    Args$ylim <- rng(data.month)
  
  ## user supplied separate ylim
  if (ylimList)
    Args$ylim <- ylim.list[[3]]
  
  ## plot
  pltMnth <- ggplot(data.month, aes(x = mnth, y = Mean, 
                                    xmin = mnth - 0.4, xmax = mnth + 0.4,
                                    ymin = Lower, ymax = Upper, 
                                    col = variable, fill = variable)) 
  if (ci) pltMnth <- pltMnth +
    geom_rect(alpha = alpha, colour = NA) 
  
  pltMnth <- pltMnth +
    geom_line(size = Args$lwd) +
    scale_x_continuous(breaks = 1:12, 
                       expand = c(0.02, 0),
                       minor_breaks = seq(0, 12, 1),
                       labels = substr(format(seq(ymd("2000-01-01"),
                                                  ymd("2000-12-31"), "month"),
                                              "%B"), 1, 1)) +
    scale_colour_manual(values = openColours(cols, npol), 
                        breaks = levels(data.hour$variable), 
                        labels = mylab) +
    scale_fill_manual(values = openColours(cols, npol), 
                      breaks = levels(data.hour$variable),
                      labels = mylab) +
    theme(legend.position = "none") +
    ylab(Args$ylab) +
    xlab(xlab[3]) +
    ylim(Args$ylim)
  

# day and hour --------------------------------------------------

  
  if (difference) {
    data.day.hour <- errorDiff(mydata, vars = "day.hour", type = type, poll1 = poll1,
                               poll2 = poll2, B = B, conf.int = conf.int)
  } else {
    data.day.hour <- plyr::ldply(conf.int, proc, mydata, vars = "day.hour", pollutant,
                                 type, B = B, statistic = statistic)
  }
  
  if (normalise) data.day.hour <-  group_by(data.day.hour, variable) %>%
    do(divide.by.mean(.))
  
  ids <- which(is.na(data.day.hour$Lower)) ## missing Lower ci, set to mean
  
  data.day.hour$Lower[ids] <-  data.day.hour$Mean[ids]
  ids <- which(is.na(data.day.hour$Upper)) ## missing Upper ci, set to mean
  data.day.hour$Upper[ids] <-  data.day.hour$Mean[ids]
  
  if (is.null(xlab[1]) | is.na(xlab[1])) xlab[1] <- "hour"
  
  ## proper names of labelling #####################################################################
  
    if (type != "default") {
    
    stripName <- sapply(levels(factor(data.day.hour[[type]])), function(x) quickText(x, auto.text))
    
  }
  
  
  temp <- paste(type, collapse = "+")
  if (type == "default") {
    myform <- formula("Mean ~ hour | wkday")
  } else {
    myform <- formula(paste("Mean ~ hour | wkday *", temp, sep = ""))
  }
  
  ## ylim hander
  if(ylim.handler)
    Args$ylim <- rng(data.day.hour)
  
  ## user supplied separate ylim
  if (ylimList)
    Args$ylim <- ylim.list[[4]]
  
  
  ## plot
  pltDayHr <- ggplot(data.day.hour, aes(x = hour, y = Mean, ymin = Lower, ymax = Upper, 
                                        col = variable, fill = variable)) 
  
  if (ci) pltDayHr <- pltDayHr +
    geom_ribbon(alpha = alpha, colour = NA) 
  
  pltDayHr <- pltDayHr +
    geom_line(size = Args$lwd) +
    scale_x_continuous(breaks = c(0, 6, 12, 18, 23), 
                       expand = c(0.02, 0),
                       minor_breaks = seq(0, 23, 3)) +
    scale_colour_manual(name = NULL,
                        values = openColours(cols, npol), 
                        breaks = levels(data.hour$variable), 
                        labels = mylab) +
    scale_fill_manual(name = NULL,
                      values = openColours(cols, npol), 
                      breaks = levels(data.hour$variable),
                      labels = mylab) +
    theme(legend.position = "bottom", legend.text.align = 0) +
    guides(col = guide_legend(ncol = npol)) +
    facet_grid(~ wkday) +
    ylab(Args$ylab) +
    xlab(xlab[2]) +
    ylim(Args$ylim)
  
  
  subsets = c("pltDayHr", "pltHr", "pltwday", "pltmnth")
  
  # for facetting
  if (type != "default") {
    pltHr <- pltHr + facet_grid(reformulate(type), labeller = label_parsed(stripName)) 
    pltMnth <- pltMnth + facet_grid(reformulate(type)) 
    pltWkday <- pltWkday + facet_grid(reformulate(type)) 
    pltDayHr <- pltDayHr + facet_grid(paste(type, "~ wkday"))
  }
  
 
  # plot all 4 plots
  layout <- matrix(c(1, 1, 1, 2, 3, 4), nrow = 2, byrow = TRUE)
  multiplot(plotlist = list(pltDayHr, pltHr, pltMnth, pltWkday), layout = layout)
  
  # return individual plots with legends
  output <- list(plot = list(pltDayHr, 
                             pltHr + theme(legend.position = "right", legend.text.align = 0), 
                             pltWkday + theme(legend.position = "right", legend.text.align = 0), 
                             pltMnth + theme(legend.position = "right", legend.text.align = 0), 
                             subsets = subsets),
                 data = list(data.day.hour, data.hour, data.weekday, 
                             data.month, subsets = subsets),
                 call = match.call())
  
  names(output$data)[1:4] <- subsets
  names(output$plot)[1:4] <- subsets
  class(output) <- "openair"
  # 
  invisible(output)
}

proc <- function(conf.int = conf.int, mydata, vars = "day.hour", 
                 pollutant, type, B = B,
                 statistic = statistic) {
  
  ## get rid of R check annoyances
  variable = value = NULL
  
  if (statistic == "mean") {
    stat <- bootMean
  } else {
    stat <- median.hilow
  }
  
  summary.values <- function(conf.int = conf.int, mydata, vars = vars, FUN, type = type, B = B,
                             statistic = statistic) {
    
    if (vars == "hour") myform <- formula(paste("value ~ variable + hour +", type))
    
    if (vars == "day.hour") myform <- formula(paste("value ~ variable + wkday + hour +", type))
    
    if (vars == "wkday") myform <- formula(paste("value ~ variable + wkday +", type))
    
    if (vars == "mnth") myform <- formula(paste("value ~ variable + mnth +", type))
    
    mydata <- aggregate(myform, data = mydata, FUN, B = B, statistic = statistic,
                        conf.int = conf.int)
    mydata
  }
  
  ## function to calculate statistics dealing with wd properly
  if (any(!pollutant %in% "wd")) {
    
    data1 <- subset(mydata, variable != "wd")
    data1 <-  summary.values(data1, vars, stat, type, B = B, statistic = statistic,
                             conf.int = conf.int)
    data1 <- data.frame(subset(data1, select = -value), data1$value)
  }
  
  if ("wd" %in% pollutant) {
    if (length(pollutant) > 1) mydata <- subset(mydata, variable == "wd")
    data2 <-  summary.values(conf.int, mydata, vars, wd.smean.normal, type,
                             B = B, statistic = statistic)
    data2 <- data.frame(subset(data2, select = -value), data2$value)
  }
  
  if (length(pollutant) > 1 & "wd" %in% pollutant) data2 <- bind_rows(data1, data2)
  
  if (!"wd" %in% pollutant) data2 <- data1
  
  if (length(pollutant) == 1 & "wd" %in% pollutant) data2 <- data2
  
  data2$ci <- conf.int
  data2
}

wd.smean.normal <- function(wd, B = B, statistic, conf.int) {
  ## function to calculate mean and 95% CI of the mean for wd
  
  u <- mean(sin(pi * wd / 180), na.rm = TRUE)
  v <- mean(cos(pi * wd / 180), na.rm = TRUE)
  Mean <- as.vector(atan2(u, v) * 360 / 2 / pi)
  ids <- which(Mean < 0)  ## ids where wd < 0
  Mean[ids] <- Mean[ids] + 360
  
  ## to calculate SD and conf int, need to know how much the angle changes from one point
  ## to the next. Also cannot be more than 180 degrees. Example change from 350 to 10 is not
  ## 340 but 20.
  wd.diff <- diff(wd)
  ids <- which(wd.diff < 0)
  wd.diff[ids] <- wd.diff[ids] + 360
  ids <- which(wd.diff > 180)
  wd.diff[ids] <- abs(wd.diff[ids] - 360)
  
  if (statistic == "mean") {
    intervals <- bootMean(wd.diff, B = B, conf.int)
  } else {
    intervals <- median.hilow(wd.diff, conf.int)
  }
  Lower <- intervals[2]
  names(Lower) <- NULL
  
  Upper <- intervals[3]
  names(Upper) <- NULL
  diff.wd <- (Upper - Lower) / 2
  
  c(Mean = Mean, Lower = Mean - diff.wd, Upper = Mean + diff.wd)
}

errorDiff <- function(mydata, vars = "day.hour", poll1, poll2, type, B = B,
                      conf.int = conf.int) {
  
  ## bootstrap mean difference confidence intervals
  ## rearrange data
  mydata <- spread(mydata, key = variable, value = value)
  if (vars == "hour") splits <- c("hour", type)
  if (vars == "day.hour") splits <- c("hour", "wkday", type)
  if (vars == "wkday") splits <- c("wkday", type)
  if (vars == "mnth") splits <- c("mnth", type)
  
  ## warnings from dplyr seem harmless FIXME
  res <- suppressWarnings(group_by_(mydata, .dots = splits) %>%
                            do(bootMeanDiff(., x = poll1, y = poll2, B = B)))
  
  # make sure we keep the order correct
  res$variable <- ordered(res$variable, levels = res$variable[1:3])
  
  res$ci <- conf.int[1]
  res
}


## function to calculate median and lower/upper quantiles
median.hilow <- function (x, conf.int = 0.95, na.rm = TRUE, ...) {
  quant <- quantile(x, probs = c(0.5, (1 - conf.int) / 2, 
                                 (1 + conf.int) / 2), na.rm = na.rm)
  names(quant) <- c("Mean", "Lower", "Upper")
  quant
}

## make poly function ########################################
mkpoly <- function(dat, x = "hour", y = "Mean", group.number, myColors, alpha) {
  len <- length(unique(dat$ci)) ## number of confidence intervals to consider
  ci <- sort(unique(dat$ci))
  
  fac <- 2
  
  if (len == 1L) id1 <- which(dat$ci == ci[1]) ## ids of higher band
  
  if (len == 2L) {
    id1 <- which(dat$ci == ci[2])
    id2 <- which(dat$ci == ci[1])
    fac <- 1
  }
  
  poly.na(dat[[x]][id1], dat$Lower[id1], dat[[x]][id1], dat$Upper[id1],
          group.number, myColors, alpha = fac * alpha / 2)
  
  if (len == 2L) poly.na(dat[[x]][id2], dat$Lower[id2], dat[[x]][id2], dat$Upper[id2],
                         group.number, myColors, alpha = alpha)
}

mkrect <- function(dat, x = "wkday", y = "Mean", group.number, myColors, alpha) {
  len <- length(unique(dat$ci)) ## number of confidence intervals to consider
  ci <- sort(unique(dat$ci))
  
  fac <- 2
  
  if (len == 1L) id1 <- which(dat$ci == ci[1]) ## ids of higher band
  
  if (len == 2L) {
    id1 <- which(dat$ci == ci[2])
    id2 <- which(dat$ci == ci[1])
    fac <- 1
  }
  
  panel.rect(dat[[x]][id1] - 0.15 * fac, dat$Lower[id1], dat[[x]][id1] + 0.15 * fac,
             dat$Upper[id1], fill = myColors[group.number],
             border = NA, alpha = fac * alpha / 2)
  
  if (len == 2L) panel.rect(dat[[x]][id2] - 0.3, dat$Lower[id2], dat[[x]][id2] + 0.3,
                            dat$Upper[id2], fill = myColors[group.number],
                            border = NA, alpha = alpha)
}
davidcarslaw/ggopenair documentation built on May 14, 2019, 10:37 p.m.