R/filter_data.r
In bigleaf: Physical and Physiological Ecosystem Properties from Eddy Covariance Data

Documented in filter.data filter.growing.season

########################
### Filter functions ###----------------------------------------------------------------------------
########################

#' Basic Eddy Covariance Data Filtering
#'
#' @description Filters time series of EC data for high-quality values and specified
#'              meteorological conditions.
#' 
#' @param data            Data.frame or matrix containing all required input variables in 
#'                        half-hourly or hourly resolution. Including year, month, day information
#' @param quality.control Should quality control be applied? Defaults to \code{TRUE}.
#' @param filter.growseas Should data be filtered for growing season? Defaults to \code{FALSE}.
#' @param filter.precip   Should precipitation filtering be applied? Defaults to \code{FALSE}.
#' @param filter.vars     Additional variables to be filtered. Vector of type character.
#' @param filter.vals.min Minimum values of the variables to be filtered. Numeric vector of 
#'                        the same length than \code{filter.vars}. Set to \code{NA} to be ignored.
#' @param filter.vals.max Maximum values of the variables to be filtered. Numeric vector of 
#'                        the same length than \code{filter.vars}. Set to \code{NA} to be ignored.
#' @param NA.as.invalid   If \code{TRUE} (the default) missing data are filtered out (applies to all variables).
#' @param vars.qc         Character vector indicating the variables for which quality filter should 
#'                        be applied. Ignored if \code{quality.control = FALSE}.
#' @param quality.ext     The extension to the variables' names that marks them as 
#'                        quality control variables. Ignored if \code{quality.control = FALSE}.                       
#' @param good.quality    Which values indicate good quality (i.e. not to be filtered) 
#'                        in the quality control (qc) variables? Ignored if \code{quality.control = FALSE}.
#' @param missing.qc.as.bad If quality control variable is \code{NA}, should the corresponding data point be
#'                          treated as bad quality? Defaults to \code{TRUE}. Ignored if \code{quality.control = FALSE}.                        
#' @param precip          Precipitation (mm time-1)
#' @param GPP             Gross primary productivity (umol m-2 s-1); Ignored if \code{filter.growseas = FALSE}.
#' @param doy             Day of year; Ignored if \code{filter.growseas = FALSE}.
#' @param year            Year; Ignored if \code{filter.growseas = FALSE}.
#' @param tGPP            GPP threshold (fraction of 95th percentile of the GPP time series).
#'                        Must be between 0 and 1. Ignored if \code{filter.growseas} is \code{FALSE}.
#' @param ws              Window size used for GPP time series smoothing. 
#'                        Ignored if \code{filter.growseas = FALSE}.
#' @param min.int         Minimum time interval in days for a given state of growing season.
#'                        Ignored if \code{filter.growseas = FALSE}.
#' @param tprecip         Precipitation threshold used to identify a precipitation event (mm). 
#'                        Ignored if \code{filter.precip = FALSE}.
#' @param precip.hours    Number of hours removed following a precipitation event (h).
#'                        Ignored if \code{filter.precip = FALSE}.
#' @param records.per.hour Number of observations per hour. I.e. 2 for half-hourly data.
#' @param filtered.data.to.NA Logical. If \code{TRUE} (the default), all variables in the input
#'                              data.frame/matrix are set to \code{NA} for the time step where ANY of the
#'                              \code{filter.vars} were beyond their acceptable range (as
#'                              determined by \code{filter.vals.min} and \code{filter.vals.max}).
#'                              If \code{FALSE}, values are not filtered, and an additional column 'valid'
#'                              is added to the data.frame/matrix, indicating if any value of a row
#'                              did (1) or did not fulfill the filter criteria (0).
#' @param constants frac2percent - conversion between fraction and percent
#' 
#' @details This routine consists of two parts:
#' 
#'          1) Quality control: All variables included in \code{vars.qc} are filtered for 
#'             good quality data. For these variables a corresponding quality variable with 
#'             the same name as the variable plus the extension as specified in \code{quality.ext}
#'             must be provided. For time steps where the value of the quality indicator is not included
#'             in the argument \code{good.quality}, i.e. the quality is not considered as 'good', 
#'             its value is set to \code{NA}.
#'             
#'          2) Meteorological filtering. Under certain conditions (e.g. low ustar), the assumptions
#'             of the EC method are not fulfilled. Further, some data analysis require certain meteorological
#'             conditions, such as periods without rainfall, or active vegetation (growing season, daytime).
#'             The filter applied in this second step serves to exclude time periods that do not fulfill the criteria
#'             specified in the arguments. More specifically, time periods where one of the variables is higher
#'             or lower than the specified thresholds (\code{filter.vals.min} and \code{filter.vals.max})
#'             are set to \code{NA} for all variables. If a threshold is set to \code{NA}, it will be ignored.
#'          
#' @return If \code{filtered.data.to.NA = TRUE} (default), the input data.frame/matrix with 
#'         observations which did not fulfill the filter criteria set to \code{NA}. 
#'         If \code{filtered.data.to.NA = FALSE}, the input data.frame/matrix with an additional 
#'         column "valid", which indicates whether all the data of a time step fulfill the 
#'         filtering criteria (1) or not (0).
#'         
#' @note The thresholds set with \code{filter.vals.min} and \code{filter.vals.max} filter all data
#'       that are smaller than ("<"), or greater than (">") the specified thresholds. That means
#'       if a variable has exactly the same value as the threshold, it will not be filtered. Likewise,
#'       \code{tprecip} filters all data that are greater than \code{tprecip}. 
#' 
#'       Variables considered of bad quality (as specified by the corresponding quality control variables)      
#'       will be set to \code{NA} by this routine. Data that do not fulfill the filtering criteria are set to
#'       \code{NA} if \code{filtered.data.to.NA = TRUE}. Note that with this option *all* variables of the same
#'       time step are set to \code{NA}. Alternatively, if \code{filtered.data.to.NA = FALSE} data are not set to \code{NA},
#'       and a new column "valid" is added to the data.frame/matrix, indicating if any value of a row
#'       did (1) or did not fulfill the filter criteria (0).
#'       
#' 
#' @examples 
#' # Example of data filtering; data are for a month within the growing season,
#' # hence growing season is not filtered.
#' # If filtered.data.to.NA=TRUE, all values of a row are set to NA if one filter
#' # variable is beyond its bounds. 
#' DE_Tha_Jun_2014_2 <- filter.data(DE_Tha_Jun_2014,quality.control=FALSE,
#'                                  vars.qc=c("Tair","precip","H","LE"),
#'                                  filter.growseas=FALSE,filter.precip=TRUE,
#'                                  filter.vars=c("Tair","PPFD","ustar"),
#'                                  filter.vals.min=c(5,200,0.2),
#'                                  filter.vals.max=c(NA,NA,NA),NA.as.invalid=TRUE,
#'                                  quality.ext="_qc",good.quality=c(0,1),
#'                                  missing.qc.as.bad=TRUE,GPP="GPP",doy="doy",
#'                                  year="year",tGPP=0.5,ws=15,min.int=5,precip="precip",
#'                                  tprecip=0.1,precip.hours=24,records.per.hour=2,
#'                                  filtered.data.to.NA=TRUE)
#'
#'  ## same, but with filtered.data.to.NA=FALSE
#'  DE_Tha_Jun_2014_3 <- filter.data(DE_Tha_Jun_2014,quality.control=FALSE,
#'                                  vars.qc=c("Tair","precip","H","LE"),
#'                                  filter.growseas=FALSE,filter.precip=TRUE,
#'                                  filter.vars=c("Tair","PPFD","ustar"),
#'                                  filter.vals.min=c(5,200,0.2),
#'                                  filter.vals.max=c(NA,NA,NA),NA.as.invalid=TRUE,
#'                                  quality.ext="_qc",good.quality=c(0,1),
#'                                  missing.qc.as.bad=TRUE,GPP="GPP",doy="doy",
#'                                  year="year",tGPP=0.5,ws=15,min.int=5,precip="precip",
#'                                  tprecip=0.1,precip.hours=24,records.per.hour=2,
#'                                  filtered.data.to.NA=FALSE)
#'                                  
#'  # note the additional column 'valid' in DE_Tha_Jun_2014_3.
#'  # To remove time steps marked as filtered out (i.e. 0 values in column 'valid'):
#'  DE_Tha_Jun_2014_3[DE_Tha_Jun_2014_3["valid"] == 0,] <- NA
#'   
#'   
#' @importFrom stats aggregate
#' @export                     
filter.data <- function(data,quality.control=TRUE,filter.growseas=FALSE,
                        filter.precip=FALSE,filter.vars=NULL,
                        filter.vals.min,filter.vals.max,NA.as.invalid=TRUE,
                        vars.qc=NULL,quality.ext="_qc",good.quality=c(0,1),
                        missing.qc.as.bad=TRUE,GPP="GPP",doy="doy",
                        year="year",tGPP=0.5,ws=15,min.int=5,precip="precip",
                        tprecip=0.01,precip.hours=24,records.per.hour=2,
                        filtered.data.to.NA=TRUE,constants=bigleaf.constants()){
  
  
  ### I) Quality control filter
  if (quality.control){
    
    if (is.null(vars.qc)){
      stop("quality.control (qc) is TRUE, but no qc variables are provided!")
    }
    
    if (any(!vars.qc %in% colnames(data))){
      
      missing_vars <- vars.qc[which(!vars.qc %in% colnames(data))]
      stop(paste("Variable ",missing_vars," is included in 'vars.qc', but does not exist in the input data!"))
      
    }
    
    vars.qc_qc <- paste0(vars.qc,quality.ext)
    if (any(!vars.qc_qc %in% colnames(data))){
      
      missing_vars_qc <- vars.qc_qc[which(!vars.qc_qc %in% colnames(data))]
      missing_vars2   <- substr(missing_vars_qc,1,nchar(missing_vars_qc) - nchar(quality.ext))
      stop(paste("Quality control for variable ",missing_vars2,"(",missing_vars_qc,") does not exist in the input data!")) 
    }
    
    ## data quality
    cat("Quality control:",fill=TRUE)
    for (var in vars.qc){
      var_qc <- paste0(var,quality.ext)
      check.input(data,var)
      check.input(data,var_qc)

      if (missing.qc.as.bad){
        data[get(paste0(var,quality.ext)) > max(good.quality) | is.na(get(paste0(var,quality.ext))),var] <- NA   # exclude bad quality data or those where qc flag is not available
        qc_invalid      <- sum(get(paste0(var,quality.ext)) > max(good.quality) | is.na(get(paste0(var,quality.ext)))) # count & report
      } else { # same, but consider missing quality flag variables as good
        data[get(paste0(var,quality.ext)) > max(good.quality) & !is.na(get(paste0(var,quality.ext))),var] <- NA
        qc_invalid      <- sum(get(paste0(var,quality.ext)) > max(good.quality) & !is.na(get(paste0(var,quality.ext))))
      }
      
      qc_invalid_perc <- round((qc_invalid/nrow(data))*constants$frac2percent,2)
      
      cat(var,": ",qc_invalid," data points (",qc_invalid_perc,"%) set to NA",fill=TRUE,sep="")
    }
  }
  
  
  #### II) Data filter
  valid <- rep(1L,nrow(data))
  
  # 1) GPP
  growseas_invalid <- numeric()
  if(filter.growseas){
    check.input(data,doy,year,GPP)
    date             <- strptime(paste0(year,"-",doy),format="%Y-%j")
    GPP_daily        <- aggregate(GPP,by=list(strftime(date)),function(x){if (sum(!is.na(x)) < 3){NA}else{mean(x,na.rm=TRUE)}})[,2]
    growing_season   <- filter.growing.season(GPP_daily,tGPP=tGPP,ws=ws,min.int=min.int)
    tsperday         <- table(as.character(date))
    growseas_invalid <- which(rep(growing_season,tsperday) == 0)
  }

  # 2) precipitation
  precip_invalid <- numeric()
  if (filter.precip){
    check.input(data,precip)
    if (NA.as.invalid){
      precip_events <- which(precip > tprecip | is.na(precip))
    } else {
      precip_events <- which(precip > tprecip)
    }
    precip_invalid <- unique(as.numeric(unlist(sapply(precip_events, function(x) x:(min(x+precip.hours*records.per.hour,nrow(data),na.rm=TRUE))))))
  }

  # 3) all other filter variables (as defined in filter.vars)
  invalids <- list(growseas_invalid,precip_invalid)
  
  if (!is.null(filter.vars)){
    for (var in filter.vars){
      v  <- which(filter.vars == var)
      vf <- v + 2
      check.input(data,var)
      if (NA.as.invalid){
        invalids[[vf]] <- which(get(var) < filter.vals.min[v] | get(var) > filter.vals.max[v] | is.na(get(var)))
      } else {
        invalids[[vf]] <- which(get(var) < filter.vals.min[v] | get(var) > filter.vals.max[v] & !is.na(get(var)))
      }
    }
  }
  
  # 4) calculate number and percentage of filtered values
  invalids_perc <- sapply(invalids, function(x) round((length(x)/nrow(data))*constants$frac2percent,2))
  
  additional_invalids <- sapply(2:length(invalids), function(x) 
    length(setdiff(invalids[[x]],unique(unlist(invalids[1:(x-1)])))))
  
  additional_invalids_perc <- round(additional_invalids/nrow(data)*constants$frac2percent,2)
  
  
  # 5) write to output
  if (filter.growseas | filter.precip | length(filter.vars) > 0){
    
    var.names <- c("growing season","precipitation",filter.vars)
    
    if (quality.control){
      cat("-------------------------------------------------------------------",fill=TRUE)
    }
      
    cat("Data filtering:",fill=TRUE)
  
    cat(length(growseas_invalid)," data points (",invalids_perc[1],"%) excluded by growing season filter",fill=TRUE,sep="")
  
    invisible(sapply(c(1:(length(invalids)-1)), function(x) cat(additional_invalids[x]," additional data points (",
                                                                additional_invalids_perc[x],"%) excluded by ",var.names[x+1],
                                                                " filter (",length(unlist(invalids[x+1]))," data points = ",
                                                                invalids_perc[x+1]," % in total)",fill=TRUE,sep="")))
  
  
    invalid        <- unique(unlist(invalids))
    valid[invalid] <- 0
  
    excl_perc <- round((length(invalid)/nrow(data))*constants$frac2percent,2)
  
    cat(length(invalid)," data points (",excl_perc,"%) excluded in total",fill=TRUE,sep="")
    cat(nrow(data) - length(invalid)," valid data points (",constants$frac2percent-excl_perc,"%) remaining.",fill=TRUE,sep="")
  
  
    # 6) return input data frame with filtered time steps set to NA or an additional 'valid' column
    if (filtered.data.to.NA){
      data_filtered <- data
      data_filtered[valid < 1,] <- NA
    } else {
      data_filtered <- data.frame(data,valid)
    }
  
  } else {
    
    data_filtered <- data
    
  }
  
  return(data_filtered)
}





#' GPP-based Growing Season Filter
#' 
#' @description Filters annual time series for growing season based on smoothed daily GPP data.
#' 
#' @param GPPd    daily GPP (any unit) 
#' @param tGPP    GPP threshold (fraction of 95th percentile of the GPP time series).
#'                Takes values between 0 and 1. 
#' @param ws      window size used for GPP time series smoothing
#' @param min.int minimum time interval in days for a given state of growing season
#' 
#' @details The basic idea behind the growing season filter is that vegetation is 
#'          considered to be active when its carbon uptake (GPP) is above a specified 
#'          threshold, which is defined relative to the peak GPP (95th percentile) 
#'          observed in the year. 
#'          The GPP-threshold is calculated as:
#'          
#'          \deqn{GPP_threshold = quantile(GPPd,0.95)*tGPP}
#'          
#'          GPPd time series are smoothed with a moving average to avoid fluctuations 
#'          in the delineation of the growing season. The window size defaults to 15 
#'          days, but depending on the ecosystem, other values can be appropriate. 
#'          
#'          The argument \code{min.int} serves to avoid short fluctuations in the 
#'          status growing season vs. no growing season by defining a minimum length
#'          of the status. If a time interval shorter than \code{min.int} is labeled
#'          as growing season or non-growing season, it is changed to the status of 
#'          the neighboring values.
#'          
#' @return a vector of type integer of the same length as the input GPPd in which 0 indicate
#'         no growing season (dormant season) and 1 indicate growing season.
#'                 
#' @importFrom stats quantile filter                                 
#' @export  
filter.growing.season <- function(GPPd,tGPP,ws=15,min.int=5){
  
  if(sum(is.na(GPPd)) < 0.5*length(GPPd)){
    
    growseas      <- rep(1,length(GPPd))
    GPP_threshold <- quantile(GPPd,probs=0.95,na.rm=TRUE)*tGPP
    
    ## smooth GPP
    GPPd_smoothed <- filter(GPPd,method="convolution",filter=rep(1/ws,ws))
    
    ## set values at the beginning and end of the time series to the mean of the original values
    wsd <- floor(ws/2)
    GPPd_smoothed[1:wsd] <- mean(GPPd[1:(2*wsd)],na.rm=TRUE)
    GPPd_smoothed[(length(GPPd)-(wsd-1)):length(GPPd)] <- mean(GPPd[(length(GPPd)-(2*wsd-1)):length(GPPd)],na.rm=TRUE)
    
    # check for occurrence of missing values and set them to mean of the values surrounding them
    missing <- which(is.na(GPPd_smoothed))
    if (length(missing) > 0){
      if (length(missing) > 10){warning("Attention, there is a gap in 'GPPd' of length n = ",length(missing))}
      replace_val <- mean(GPPd_smoothed[max(1,missing[1] - 4):min((missing[length(missing)] + 4),length(GPPd_smoothed))],na.rm=TRUE)
      GPPd_smoothed[missing] <- replace_val
    }
    
    # filter daily GPP
    growseas[GPPd_smoothed < GPP_threshold] <- 0
    
    ## change short intervals to the surrounding values to avoid 'wrong' fluctuations
    intervals <- rle(growseas)
    short_int <- which(intervals$lengths <= min.int)
    
    if (length(short_int) > 0){
      start <- numeric()
      end   <- numeric()
      
      for (i in 1:length(short_int)){
        
        start[i] <- sum(intervals$lengths[1:short_int[i]-1]) + 1
        end[i]   <- start[i]+intervals$lengths[short_int[i]] - 1
        
        val <- unique(growseas[start[i]:end[i]])
        
        if (val == 0 & growseas[start[i]-1] == 1){
          growseas[start[i]:end[i]] <- 1   
        } else if (val == 1 & growseas[start[i]-1] == 0){
          growseas[start[i]:end[i]] <- 0
        }
      }
    }
    
    growseas <- as.integer(growseas)
    
  } else {
    
    warning("number of available GPPd data is less than half the total number of days per year. Filter is not applied!")
    growseas <- as.integer(rep(1,length(GPPd)))
    
  }
  
  return(growseas)
}
Any scripts or data that you put into this service are public.
bigleaf documentation built on Aug. 22, 2022, 9:09 a.m.
rdrr.io home R language documentation Run R code online
CRAN packages Bioconductor packages R-Forge packages GitHub packages
Note that we can't provide technical support on individual packages. You should contact the package authors for that.
bigleaf
Physical and Physiological Ecosystem Properties from Eddy Covariance Data

R/filter_data.r
In bigleaf: Physical and Physiological Ecosystem Properties from Eddy Covariance Data

Defines functions filter.growing.season filter.data

Documented in filter.data filter.growing.season

Try the bigleaf package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

bigleaf Physical and Physiological Ecosystem Properties from Eddy Covariance Data

R/filter_data.r In bigleaf: Physical and Physiological Ecosystem Properties from Eddy Covariance Data

Defines functions filter.growing.season filter.data

Documented in filter.data filter.growing.season

Try the bigleaf package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

bigleaf
Physical and Physiological Ecosystem Properties from Eddy Covariance Data

R/filter_data.r
In bigleaf: Physical and Physiological Ecosystem Properties from Eddy Covariance Data