despikeLF: Low Frequency Data Despiking

View source: R/Quality_checking.R

despikeLFR Documentation

Low Frequency Data Despiking

Description

Scaled median absolute deviation from the median is applied to double-differenced time series to identify outliers.

Usage

despikeLF(
  x,
  var,
  qc_flag,
  name_out = "-",
  var_thr = NULL,
  iter = 10,
  plot = FALSE,
  light = c("PAR", "GR"),
  night_thr = 10,
  nVals = 50,
  z = 7,
  c = 4.4478
)

Arguments

x

A data frame with column names representing required variables. See 'Details' below.

var

A character string. Specifies the variable name in x with values to be despiked.

qc_flag

A character string. Specifies the column name in x with var related quality control flag.

name_out

A character string providing varnames attribute value of the output.

var_thr

A numeric vector with 2 non-missing values. Specifies fixed thresholds for var values. Values outside this range will be flagged as spikes (flag 2). If var_thr = NULL, thresholds are not applied.

iter

An integer value. Defines number of despiking iterations.

plot

A logical value. If TRUE, list of ggplot objects visualizing the spikes is also produced.

light

A character string. Selects preferred variable for incoming light intensity. "PAR" or "GR" is allowed. Can be abbreviated. If light = NULL, var values are not separated to night/day subsets and night_thr is not used.

night_thr

A numeric value that defines the threshold between night (for light values equal or lower than night_thr) and day (for light values higher than night_thr) for incoming light.

nVals

A numeric value. Number of values within 13-day blocks required to obtain robust statistics.

z

A numeric value. MAD scale factor.

c

A numeric value. mad scale factor. Default is 3 * mad constant (i.e. 3 * 1.4826 = 4.4478).

Details

Low Frequency Data Despiking is not an additive quality control (QC) test. despikeLF follows the QC scheme using QC flag range 0 - 2. varnames attribute of returned vector should be chosen to follow the 'Naming Strategy' described in extract_QC, i.e. to be distinguished by suffix "_spikesLF".

The data frame x is expected to have certain properties. It is required that it contains column named "timestamp" of class "POSIXt" with regular sequence of date-time values, typically with (half-)hourly time interval. Any missing values in "timestamp" are not allowed. Thus, if no records exist for given date-time value, it still has to be included. It also has to contain required (depends on the argument values) column names. If QC flags are not available for var, qc_flag still has to be included in x as a named column with all values set to 0 (i.e. all values will be checked for outliers).

Only non-missing var values with corresponding qc_flag values below 2 are used to detect the outliers. Missing var values or those with assigned flag 2 or NA are not checked and marked by NA flag in the output. Thus NA values of despikeLF should be considered as not checked records and therefore interpreted as 0 flag within the 0 - 2 quality control scheme.

var_thr is intended for exclusion of data clearly outside of theoretically acceptable range for the whole dataset. If var_thr is specified, var values below var_thr[1] and above var_thr[2] are marked as spikes (flag 2) in the output. Such values are further not used for computing statistics on double-differenced time series.

light and night_thr are intended to separate data to night and day subsets with different statistical properties. NAs in x[light] are thus not allowed due to the subsetting. Despiking is then applied to individual subsets and combined QC flags are returned.

Despiking is done within blocks of 13 consecutive days to account for seasonality of measured variable. Within each block, all records are compared with its neighbours and d[i] scores are produced. This is achieved by double-differencing:

d[i] = (var[i] - var[i-1]) - (var[i+1] - var[i])

In order to obtain maximum amount of d[i] scores, all missing var values are removed from the block before d[i] scores are produced. var values are marked as spikes if d[i] is higher (lower) than median of d[i] scores (M[d]) + (-) scaled median absolute deviation:

d[i] > M[d] + (z * MAD / 0.6745)

d[i] < M[d] - (z * MAD / 0.6745)

MAD is defined as:

MAD = median(abs(d[i] - M[d]))

The algorithm tends to flag also values that are neighbours of spikes. To prevent false flagging, median and mad of var values within given block (M[var] and mad[var], respectively) is computed. Values can be marked as spikes only if

var[i] > M[var] + (c * mad / 1.4826)

or

var[i] < M[var] - (c * mad / 1.4826)

Number of available double-differenced var values (nVals) is checked within each block. If equal or below nVals, d[i] or var[i] values are checked against the statistics computed using entire dataset to ensure robustness.

The whole process is repeated iteratively if iter > 1. This way new statistics are produced for each iteration after exclusion of already detected outliers and new spikes can be identified.

Value

If plot = FALSE, an integer vector with attributes "varnames" and "units". If plot = TRUE, a list with elements SD and plots. SD contains identical output as produced when plot = FALSE, plots contains list of ggplot objects for respective iteration, light subset and 13-day period.

Side effect: the counts of spikes detected in each iteration are printed to console.

Plotting

Plots are produced as a list of ggplot objects. Thus they can be assigned to an object and modified as needed before actual plotting. Each plot consists of two panels. The upper one shows the double-differenced time series, the bottom one the actual var values. Grey bands mark the respective intervals in which var value cannot be considered as an outlier. The red points in upper panel show all points that would be marked as spikes if c = 0. Only the points marked by blue color (bottom panel) will be considered spikes. The spike detection tolerance (width of grey bands) can be modified by scale factors z (upper panel) and c (bottom panel).

Abbreviations

  • QC: Quality Control

  • PAR: Photosynthetic Active Radiation [umol m-2 s-1]

  • GR: Global Radiation [W m-2]

References

Mauder, M., Cuntz, M., Drue, C., Graf, A., Rebmann, C., Schmid, H.P., Schmidt, M., Steinbrecher, R., 2013. A strategy for quality and uncertainty assessment of long-term eddy-covariance measurements. Agric. For. Meteorol. 169, 122-135. https://doi.org/10.1016/j.agrformet.2012.09.006

Papale, D., Reichstein, M., Canfora, E., Aubinet, M., Bernhofer, C., Longdoz, B., Kutsch, W., Rambal, S., Valentini, R., Vesala, T., Yakir, D., 2006. Towards a more harmonized processing of eddy covariance CO2 fluxes: algorithms and uncertainty estimation. Biogeosciences Discuss. 3, 961-992. https://doi.org/10.5194/bgd-3-961-2006

Sachs, L., 1996. Angewandte Statistik: Anwendung Statistischer Methoden, Springer, Berlin.

See Also

combn_QC, extract_QC, median and mad.


lsigut/openeddy documentation built on Aug. 5, 2023, 12:25 a.m.