R/timeAverage.R

##' Function to calculate time averages for data frames
##' 
##' Function to flexibly aggregate or expand data frames by different 
##' time periods, calculating vector-averaged wind direction where 
##' appropriate. The averaged periods can also take account of data 
##' capture rates.
##' 
##' This function calculates time averages for a data frame. It also 
##' treats wind direction correctly through vector-averaging. For 
##' example, the average of 350 degrees and 10 degrees is either 0 or 
##' 360 - not 180. The calculations therefore average the wind 
##' components.
##' 
##' When a data capture threshold is set through \code{data.thresh} it
##' is necessary for \code{timeAverage} to know what the original time
##' interval of the input time series is. The function will try and 
##' calculate this interval based on the most common time gap (and 
##' will print the assumed time gap to the screen). This works fine 
##' most of the time but there are occasions where it may not e.g.
##' when very few data exist in a data frame or the data are monthly
##' (i.e. non-regular time interval between months). In this case the
##' user can explicitly specify the interval through \code{interval}
##' in the same format as \code{avg.time} e.g. \code{interval =
##' "month"}. It may also be useful to set \code{start.date} and
##' \code{end.date} if the time series do not span the entire period
##' of interest. For example, if a time series ended in October and
##' annual means are required, setting \code{end.date} to the end of
##' the year will ensure that the whole period is covered and that
##' \code{data.thresh} is correctly calculated. The same also goes for
##' a time series that starts later in the year where
##' \code{start.date} should be set to the beginning of the year.
##' 
##' \code{timeAverage} should be useful in many circumstances where it
##' is necessary to work with different time average data. For 
##' example, hourly air pollution data and 15-minute meteorological 
##' data. To merge the two data sets \code{timeAverage} can be used to
##' make the meteorological data 1-hour means first. Alternatively, 
##' \code{timeAverage} can be used to expand the hourly data to 15 
##' minute data - see example below.
##' 
##' For the research community \code{timeAverage} should be useful for
##' dealing with outputs from instruments where there are a range of 
##' time periods used.
##' 
##' It is also very useful for plotting data using 
##' \code{\link{timePlot}}.  Often the data are too dense to see 
##' patterns and setting different averaging periods easily helps with
##' interpretation.
##' 
##' @param mydata A data frame containing a \code{date} field . Can be
##'   class \code{POSIXct} or \code{Date}.
##' @param avg.time This defines the time period to average to. Can be
##'   \dQuote{sec}, \dQuote{min}, \dQuote{hour}, \dQuote{day}, 
##'   \dQuote{DSTday}, \dQuote{week}, \dQuote{month}, \dQuote{quarter}
##'   or \dQuote{year}. For much increased flexibility a number can 
##'   precede these options followed by a space. For example, a 
##'   timeAverage of 2 months would be \code{period = "2 month"}. In 
##'   addition, \code{avg.time} can equal \dQuote{season}, in which
##'   case 3-month seasonal values are calculated with spring defined
##'   as March, April, May and so on.
##'   
##'   Note that \code{avg.time} can be \emph{less} than the time 
##'   interval of the original series, in which case the series is 
##'   expanded to the new time interval. This is useful, for example, 
##'   for calculating a 15-minute time series from an hourly one where
##'   an hourly value is repeated for each new 15-minute period. Note 
##'   that when expanding data in this way it is necessary to ensure 
##'   that the time interval of the original series is an exact
##'   multiple of \code{avg.time} e.g. hour to 10 minutes, day to
##'   hour. Also, the input time series must have consistent time gaps
##'   between successive intervals so that \code{timeAverage} can work
##'   out how much \sQuote{padding} to apply. To pad-out data in this
##'   way choose \code{fill = TRUE}.
##' @param data.thresh The data capture threshold to use (\%). A value
##'   of zero means that all available data will be used in a
##'   particular period regardless if of the number of values 
##'   available. Conversely, a value of 100 will mean that all data
##'   will need to be present for the average to be calculated, else
##'   it is recorded as \code{NA}. See also \code{interval},
##'   \code{start.date} and \code{end.date} to see whether it is
##'   advisable to set these other options.
##'   
##' @param statistic The statistic to apply when aggregating the data;
##'   default is the mean. Can be one of \dQuote{mean}, \dQuote{max},
##'   \dQuote{min}, \dQuote{median}, \dQuote{frequency}, \dQuote{sd},
##'   \dQuote{percentile}. Note that \dQuote{sd} is the standard
##'   deviation, \dQuote{frequency} is the number (frequency) of valid
##'   records in the period and \dQuote{data.cap} is the percentage
##'   data capture. \dQuote{percentile} is the percentile level (\%)
##'   between 0-100, which can be set using the \dQuote{percentile}
##'   option --- see below. Not used if \code{avg.time = "default"}.
##'   
##' @param type \code{type} allows \code{timeAverage} to be applied to
##'   cases where there are groups of data that need to be split and 
##'   the function applied to each group. The most common example is 
##'   data with multiple sites identified with a column representing 
##'   site name e.g. \code{type = "site"}. More generally, \code{type}
##'   should be used where the date repeats for a particular grouping 
##'   variable. However, if type is not supplied the data will still
##'   be averaged but the grouping variables (character or factor)
##'   will be dropped.
##' @param percentile The percentile level in \% used when 
##'   \code{statistic = "percentile"}. The default is 95.
##' @param start.date A string giving a start date to use. This is 
##'   sometimes useful if a time series starts between obvious 
##'   intervals. For example, for a 1-minute time series that starts 
##'   \dQuote{2009-11-29 12:07:00} that needs to be averaged up to 
##'   15-minute means, the intervals would be \dQuote{2009-11-29 
##'   12:07:00}, \dQuote{2009-11-29 12:22:00} etc. Often, however, it
##'   is better to round down to a more obvious start point e.g.
##'   \dQuote{2009-11-29 12:00:00} such that the sequence is then 
##'   \dQuote{2009-11-29 12:00:00}, \dQuote{2009-11-29 12:15:00}
##'   \ldots{} \code{start.date} is therefore used to force this type
##'   of sequence.
##' @param end.date A string giving an end date to use. This is 
##'   sometimes useful to make sure a time series extends to a known
##'   end point and is useful when \code{data.thresh} > 0 but the
##'   input time series does not extend up to the final full interval.
##'   For example, if a time series ends sometime in October but
##'   annual means are required with a data capture of >75\% then it
##'   is necessary to extend the time series up until the end of the
##'   year. Input in the format yyyy-mm-dd HH:MM.
##' @param interval The \code{timeAverage} function tries to determine
##'   the interval of the original time series (e.g. hourly) by 
##'   calculating the most common interval between time steps. The 
##'   interval is needed for calculations where the \code{data.thresh}
##'   >0. For the vast majority of regular time series this works 
##'   fine. However, for data with very poor data capture or irregular
##'   time series the automatic detection may not work. Also, for time
##'   series such as monthly time series where there is a variable 
##'   difference in time between months users should specify the time 
##'   interval explicitly e.g. \code{interval = "month"}. Users can
##'   also supply a time interval to \emph{force} on the time series.
##'   See \code{avg.time} for the format.
##'   
##'   This option can sometimes be useful with \code{start.date} and 
##'   \code{end.date} to ensure full periods are considered e.g. a
##'   full year when \code{avg.time = "year"}.
##' @param vector.ws Should vector averaging be carried out on wind 
##'   speed if available? The default is \code{FALSE} and scalar 
##'   averages are calculated. Vector averaging of the wind speed is 
##'   carried out on the u and v wind components. For example,
##'   consider the average of two hours where the wind direction and
##'   speed of the first hour is 0 degrees and 2m/s and 180 degrees
##'   and 2m/s for the second hour. The scalar average of the wind
##'   speed is simply the arithmetic average = 2m/s and the vector
##'   average is 0m/s. Vector-averaged wind speeds will always be
##'   lower than scalar-averaged values.
##' @param fill When time series are expanded i.e. when a time 
##'   interval is less than the original time series, data are 
##'   \sQuote{padded out} with \code{NA}. To \sQuote{pad-out} the 
##'   additional data with the first row in each original time
##'   interval, choose \code{fill = TRUE}.
##' @param ... Additional arguments for other functions calling 
##'   \code{timeAverage}.
##' @import dplyr
##' @export
##' @importFrom Rcpp evalCpp
##' @return Returns a data frame with date in class \code{POSIXct}.
##' @author David Carslaw
##' @seealso See \code{\link{timePlot}} that plots time series data
##'   and uses \code{timeAverage} to aggregate data where necessary.
##' @keywords methods
##' @examples
##' 
##' ## daily average values
##' daily <- timeAverage(mydata, avg.time = "day")
##' 
##' ## daily average values ensuring at least 75 % data capture
##' ## i.e. at least 18 valid hours
##' \dontrun{daily <- timeAverage(mydata, avg.time = "day", data.thresh = 75)}
##' 
##' ## 2-weekly averages
##' \dontrun{fortnight <- timeAverage(mydata, avg.time = "2 week")}
##' 
##' ## make a 15-minute time series from an hourly one
##' \dontrun{
##' min15 <-  timeAverage(mydata, avg.time = "15 min", fill = TRUE)
##' }
##' 
##' # average by grouping variable
##' \dontrun{
##' dat <- importAURN(c("kc1", "my1"), year = 2011:2013)
##' timeAverage(dat, avg.time = "year", type = "site")
##' 
##' # can also retain site code
##' timeAverage(dat, avg.time = "year", type = c("site", "code"))
##' 
##' # or just average all the data, dropping site/code
##' timeAverage(dat, avg.time = "year")
##' }
timeAverage <- function(mydata, avg.time = "day", data.thresh = 0,
                        statistic = "mean", type = "default", percentile = NA,
                        start.date = NA, end.date = NA, interval = NA,
                        vector.ws = FALSE, fill = FALSE, ...) {
  
  ## get rid of R check annoyances
  year = season = month = Uu = Vv = site = default = wd = ws = NULL
  
  ## extract variables of interest
  vars <- unique(c("date", names(mydata)))
  
  ## whether a time series has already been padded to fill time gaps
  padded <- FALSE
  
  mydata <- checkPrep(mydata, vars, type = "default", remove.calm = FALSE,
                      strip.white = FALSE)
  
  ## time zone of data
  TZ <- attr(mydata$date, "tzone")
  if (is.null(TZ)) TZ <- "GMT" ## as it is on Windows for BST
  
  if (!is.na(percentile)) {
    percentile <- percentile / 100
    if (percentile < 0 | percentile > 100) stop("Percentile range outside 0-100")
  }
  if (data.thresh < 0 | data.thresh > 100) stop("Data capture range outside 0-100")
  
  ## make data.thresh a number between 0 and 1
  data.thresh <- data.thresh / 100
  
  if (!statistic %in% c("mean", "median", "frequency", "max", "min", "sum",
                        "sd", "percentile", "data.cap"))
    stop("Statistic not recognised")
  
  if (statistic == "mean") FUN <- function (x) mean(x, na.rm = TRUE)
  if (statistic == "median") FUN <- function (x) Cquantile(x, probs = 0.50)
  if (statistic == "frequency") FUN <- function (x) length(na.omit(x))
  if (statistic == "max") FUN <- function (x) Cquantile(x, probs = 1.0)
  if (statistic == "min") FUN <- function (x) Cquantile(x, probs = 0)
  if (statistic == "sum") FUN <- function (x) sum(x, na.rm = TRUE)
  if (statistic == "sd") FUN <- function (x) sd(x, na.rm = TRUE)
  if (statistic == "data.cap") FUN <- function (x) {
    
    if (all(is.na(x))) {
      
      res <- 0
      
    } else {
      
      res <- 100 * (1 - sum(is.na(x)) / length(x))
    }
    res
  }
  
  if (statistic == "percentile") FUN <- function (x)
    Cquantile(x, probs = percentile)
  
  calc.mean <- function(mydata, start.date) { ## function to calculate means
    
    ## need to check whether avg.time is > or < actual time gap of data
    ## then data will be expanded or aggregated accordingly
    
    ## start from a particular time, if given
    if (!is.na(start.date)) {
      
      firstLine <- data.frame(date = as.POSIXct(start.date, tz = TZ))
      
      ## add in type
      firstLine[[type]] <- mydata[[type]][1]
      mydata <- bind_rows(firstLine, mydata)
      
      ## for cutting data must ensure it is in GMT because combining
      ## data frames when system is not GMT puts it in local time!...
      ## and then cut makes a string/factor levels with tz lost...
      
      mydata$date <- as.POSIXct(format(mydata$date), tz = TZ)
      
    }
    
    if (!is.na(end.date)) {
      
      lastLine <- data.frame(date = as.POSIXct(end.date, tz = TZ))
      lastLine[[type]] <- mydata[[type]][1]
      
      mydata <- bind_rows(mydata, lastLine)
      
      ## for cutting data must ensure it is in GMT because combining
      ## data frames when system is not GMT puts it in local time!...
      ## and then cut makes a string/factor levels with tz lost...
      
      mydata$date <- as.POSIXct(format(mydata$date), tz = TZ)
      
    }
    
    ## if interval specified, then use it
    if (!is.na(interval)) {
      
      mydata <- group_by_(mydata, .dots = type) %>%
        do(date.pad2(., type = type, interval = interval))
      
      ## make sure missing types are inserted
    #  mydata[[type]] <- mydata[[type]][1]
      mydata[type] <- mydata[1, type]
      
      padded <- TRUE
    }
    
    ## If interval of original time series not specified, calculate it
    ## time diff in seconds of orginal data
    timeDiff <-  as.numeric(strsplit(find.time.interval(mydata$date),
                                     " ")[[1]][1])
    
    ## time diff of new interval
    by2 <- strsplit(avg.time, " ", fixed = TRUE)[[1]]
    
    seconds <- 1
    if (length(by2) > 1) seconds <- as.numeric(by2[1])
    units <- by2[length(by2)]
    
    
    if (units == "sec") int <- 1
    if (units == "min") int <- 60
    if (units == "hour") int <- 3600
    if (units == "day") int <- 3600 * 24
    if (units == "week") int <- 3600 * 24 * 7
    if (units == "month") int <- 3600 * 24 * 31 ## approx
    if (units == "quarter" || units == "season") int <- 3600 * 24 * 31 * 3 ## approx
    if (units == "year") int <- 3600 * 8784 ## approx
    
    seconds <- seconds * int ## interval in seconds
    if (is.na(timeDiff)) timeDiff <- seconds ## when only one row
    
    ## check to see if we need to expand data rather than aggregate it
    ## i.e. chosen time interval less than that of data
    if (seconds < timeDiff) {
      
      ## orginal dates
      theDates <- mydata$date
      
      ## need to add a date to the end when expanding times
      interval <- find.time.interval(mydata$date)
      
      ## equivalent number of days, used to refine interval for month/year
      days <- as.numeric(strsplit(interval, split = " ")[[1]][1]) /
        24 / 3600
      
      ## find time interval of data
      if (class(mydata$date)[1] == "Date") {
        
        interval <- paste(days, "day")
        
      } else {
        ## this will be in seconds
        interval <- find.time.interval(mydata$date)
        
      }
      
      ## better interval, most common interval in a year
      if (days %in% c(30, 31)) interval <- "month"
      if (days %in% c(365, 366)) interval <- "year"
      
      allDates <- seq(min(mydata$date), max(mydata$date), by = interval)
      allDates <- c(allDates, max(allDates) + timeDiff)
      
      ## all data with new time interval
      allData <- data.frame(date = seq(min(allDates), max(allDates), avg.time))
      
      ## merge with orginal data, which leaves gaps to fill
      mydata <- full_join(mydata, allData, by = "date") %>%
        arrange(date)
      
      if (fill) {
        
        ## this will copy-down data to next original row of data
        ## number of additional lines to fill
        inflateFac <-  timeDiff / seconds
        if (timeDiff %% seconds != 0) 
          stop("Non-regular time expansion selected, or non-regular input time series.")
        
        ## ids of orginal dates in new dates
        ids <- which(mydata$date %in% theDates)
        
        date <- mydata$date
        mydata <-subset(mydata, select = -date)
        
        for (i in 1:(inflateFac - 1)) {
          mydata[ids + i, ] <-  mydata[ids, ]
        }
        
        mydata <- cbind(date, mydata)
        mydata <- mydata[1:nrow(mydata) - 1, ] ## don't need last row
      }
      
      return(mydata)
      
    }
    
    
    
    ## calculate wind components
    if ("wd" %in% names(mydata)) {
      if (is.numeric(mydata$wd) && "ws" %in% names(mydata)) {
        mydata <- transform(mydata,  Uu = ws * sin(2 * pi * wd / 360),
                            Vv = ws * cos(2 * pi * wd / 360))
        
      }
      
      if (is.numeric(mydata$wd) && !"ws" %in% names(mydata)) {
        mydata <- transform(mydata,  Uu = sin(2 * pi * wd / 360),
                            Vv = cos(2 * pi * wd / 360))
      }
    }
    
    if (avg.time == "season") {
      ## special case for season
      ## need to group specific months: Dec/Jan/Feb etc
      
      mydata <- cutData(mydata, type = "season")
      ## remove any missing seasons e.g. through type = "season"
      mydata <- mydata[!is.na(mydata$season), ]
      
      ## calculate year
      mydata <- transform(mydata, year = as.numeric(format(date, "%Y")),
                          month = as.numeric(format(date, "%m")))
      
      ## ids where month = 12, make December part of following year's season
      ids <- which(mydata$month == 12)
      mydata$year[ids] <- mydata$year[ids] + 1
      
      ## find mean date in year-season
      mydata <- transform(mydata, date = ave(date, list(year, season), 
                                             FUN = mean))
      
      mydata <- subset(mydata, select = -c(year, month))
      
    } 
    
    ## Aggregate data
    
    ## variables to split by
    vars <- c(type,  "date")
    
    if (avg.time == "season") vars <- c(vars, "season")
    
    if (data.thresh !=0) { ## take account of data capture
      
      ## need to make sure all data are present..
      ## print out time interval assumed for input time series
      ## useful for debugging
      if (!padded) mydata <- date.pad(mydata, type = type)
      
      if (avg.time != "season") mydata$date <- lubridate::floor_date(mydata$date, avg.time)
      
      if (statistic == "mean") {## faster for some reason?
        
        avmet <- group_by_(mydata, .dots = vars) %>%
          summarise_each(
            funs(
              if (sum(is.na(.)) / length(.) <= 1 - data.thresh)
                mean(., na.rm = TRUE)
              else NA
            )
          )
        
      } else {
        
        avmet <- group_by_(mydata, .dots = vars) %>%
          summarise_each(
            funs(
              if (sum(is.na(.)) / length(.) <= 1 - data.thresh)
                FUN(.)
              else NA
            )
          )
        
      }
      
    } else {
      
      ## faster if do not need data capture
      if (avg.time != "season")
        mydata$date <- lubridate::floor_date(mydata$date, avg.time)
      
      avmet <- #select(mydata, -date) %>%
        group_by_(mydata, .dots = vars) # %>%
      
      # This is much faster for some reason
      if (statistic == "mean") {
        avmet <- avmet %>% summarise_each(funs(mean(., na.rm = TRUE)))
        
      } else {
        
        avmet <- avmet %>% summarise_each(funs(FUN(.)))
      }
      
    }
    

    if ("wd" %in% names(mydata)) {
      
      if (is.numeric(mydata$wd)) {
        
        ## mean wd
        avmet <- transform(avmet, 
                           wd = as.vector(atan2(Uu, Vv) * 360 / 2 / pi))
        
        ## correct for negative wind directions
        ids <- which(avmet$wd < 0)  ## ids where wd < 0
        avmet$wd[ids] <- avmet$wd[ids] + 360
        
        ## vector average ws
        if ("ws" %in% names(mydata)) {
          if (vector.ws)
            avmet <- transform(avmet, ws = (Uu ^ 2 + Vv ^ 2) ^ 0.5)
        }
        
        avmet <- subset(avmet, select = c(-Uu, -Vv))
      }
    }
    
    ## fill missing gaps
    if (avg.time != "season") {
      
      avmet <- date.pad2(avmet, type = type, interval = avg.time)
    }
    
    avmet
    
  }
  
  ## cut data into intervals
  mydata <- cutData(mydata, type)
  
  ## ids of numeric columns, type and date
  ids <- c(which(names(mydata) %in% c("date", type)),
           which(sapply(mydata, function(x) !is.factor(x) && !is.character(x))))
  mydata <- mydata[, unique(ids)]
  
  
  ## some LAQN data seem to have the odd missing site name
  if ("site" %in% names(mydata)) { ## split by site
    
    ## remove any NA sites
    if (anyNA(mydata$site)) {
      id <- which(is.na(mydata$site))
      mydata <- mydata[-id, ]
    }
    
    mydata$site <- factor(mydata$site) ## removes empty factors
  }
  
  ## calculate stats split by type
  mydata <- group_by_(mydata, .dots = type) %>%
    do(calc.mean(., start.date))
  
  ## don't need default column
  if ("default" %in% names(mydata))
    mydata <- subset(mydata, select = -default)
  
  mydata
}
davidcarslaw/ggopenair documentation built on May 14, 2019, 10:37 p.m.