despike: Low Frequency Data Despiking
In grahamstewart12/tidyflux:

Description Usage Arguments Details Value Plotting References

Scaled median absolute deviation from the median is applied to double-differenced time series to identify outliers.

1
2
3

despike(x, timestamp, qc_flag = NULL, var_thr = NULL,
  only_thr = FALSE, iter = 10, plot = FALSE, light = rg,
  night_thr = 12, n = 50, z = 5, c = 4.4478)

`x`	A data frame with column names representing required variables. See 'Details' below.
`qc_flag`	A character string. Specifies the column name in `x` with `var` related quality control flag.
`var_thr`	A numeric vector with 2 non-missing values. Specifies fixed thresholds for `var` values. Values outside this range will be flagged as spikes (flag 2). If `var_thr = NULL`, thresholds are not applied.
`iter`	An integer value. Defines number of despiking iterations.
`plot`	A logical value. If `TRUE`, list of `ggplot` objects visualizing the spikes is also produced.
`light`	A character string. Selects preferred variable for incoming light intensity. `"PAR"` or `"Rg"` is allowed. Can be abbreviated. If `light = NULL`, `var` values are not separated to night/day subsets and `night_thr` is not used.
`night_thr`	A numeric value that defines the threshold between night (for `light` values equal or lower than `night_thr`) and day (for `light` values higher than `night_thr`) for incoming light.
`n`	A numeric value. Number of values within 13-day blocks required to obtain robust statistics.
`z`	A numeric value. MAD scale factor.
`c`	A numeric value. `mad` scale factor. Default is `3 * mad constant` (`i.e. 3 * 1.4826 = 4.4478`).
`var`	A character string. Specifies the variable name in `x` with values to be despiked.

Low Frequency Data Despiking is not an additive quality control (QC) test. despike follows the QC scheme using QC flag range 0 - 2. varnames attribute of returned vector should be chosen to follow the 'Naming Strategy' described in extract_QC, i.e. to be distinguished by suffix "_spikesLF".

The data frame x is expected to have certain properties. It is required that it contains column named "timestamp" of class "POSIXt" with regular sequence of date-time values, typically with (half-)hourly frequency. Any missing values in "timestamp" are not allowed. Thus, if no records exist for given date-time value, it still has to be included. It also has to contain required (depends on the argument values) column names. If QC flags are not available for var, qc_flag still has to be included in x as a named column with all values set to 0 (i.e. all values will be checked for outliers).

Only non-missing var values with corresponding qc_flag values below 2 are used to detect the outliers. Missing var values or those with assigned flag 2 or NA are not checked and marked by NA flag in the output. Thus NA values of despike should be considered as not checked records and therefore interpreted as 0 flag within the 0 - 2 quality control scheme.

var_thr is intended for exclusion of data clearly outside of theoretically acceptable range for the whole dataset. If var_thr is specified, var values below var_thr[1] and above var_thr[2] are marked as spikes (flag 2) in the output. Such values are further not used for computing statistics on double-differenced time series.

light and night_thr are intended to separate data to night and day subsets with different statistical properties. NAs in x[light] are thus not allowed due to the subsetting. Despiking is then applied to individual subsets and combined QC flags are returned.

Despiking is done within blocks of 13 consecutive days to account for seasonality of measured variable. Within each block, all records are compared with its neighbours and d[i] scores are produced. This is achieved by double-differencing:

d[i] = (var[i] - var[i-1]) - (var[i+1] - var[i])

In order to obtain maximum amount of d[i] scores, all missing var values are removed from the block before d[i] scores are produced. var values are marked as spikes if d[i] is higher (lower) than median of d[i] scores (M[d]) + (-) scaled median absolute deviation:

d[i] > M[d] + (z * MAD / 0.6745)

d[i] < M[d] - (z * MAD / 0.6745)

MAD is defined as:

MAD = median(abs(d[i] - M[d]))

The algorithm tends to flag also values that are neighbours of spikes. To prevent false flagging, median and mad of var values within given block (M[var] and mad[var], respectively) is computed. Values can be marked as spikes only if

var[i] > M[var] + (c * mad / 1.4826)

var[i] < M[var] - (c * mad / 1.4826)

Number of available double-differenced var values (nVals) is checked within each block. If equal or below nVals, d[i] or var[i] values are checked against the statistics computed using entire dataset to ensure robustness.

The whole process is repeated iteratively if iter > 1. This way new statistics are produced for each iteration after exclusion of already detected outliers and new spikes can be identified.

If plot = FALSE, an integer vector with attributes "varnames" and "units". If plot = TRUE, a list with elements SD and plots. SD contains identical output as produced when plot = FALSE, plots contains list of ggplot objects for respective iteration, light subset and 13-day period.

Side effect: the counts of spikes detected in each iteration are printed to console.

Plots are produced as a list of ggplot objects. Thus they can be assigned to an object and modified as needed before actual plotting. Each plot consists of two panels. The upper one shows the double-differenced time series, the bottom one the actual var values. Grey bands mark the respective intervals in which var value cannot be considered as an outlier. The red points in upper panel show all points that would be marked as spikes if c = 0. Only the points marked by blue color (bottom panel) will be considered spikes. The spike detection tolerance (width of grey bands) can be modified by scale factors z (upper panel) and c (bottom panel).

Mauder, M., Cuntz, M., Drue, C., Graf, A., Rebmann, C., Schmid, H.P., Schmidt, M., Steinbrecher, R., 2013. A strategy for quality and uncertainty assessment of long-term eddy-covariance measurements. Agric. For. Meteorol. 169, 122-135. doi:10.1016/j.agrformet.2012.09.006

Papale, D., Reichstein, M., Canfora, E., Aubinet, M., Bernhofer, C., Longdoz, B., Kutsch, W., Rambal, S., Valentini, R., Vesala, T., Yakir, D., 2006. Towards a more harmonized processing of eddy covariance CO2 fluxes: algorithms and uncertainty estimation. Biogeosciences Discuss. 3, 961-992. doi:10.5194/bgd-3-961-2006

grahamstewart12/tidyflux documentation built on June 4, 2019, 7:44 a.m.