Description Usage Arguments Details Value Plotting References
Scaled median absolute deviation from the median is applied to double-differenced time series to identify outliers.
1 2 3 |
x |
A data frame with column names representing required variables. See 'Details' below. |
qc_flag |
A character string. Specifies the column name in |
var_thr |
A numeric vector with 2 non-missing values. Specifies fixed
thresholds for |
iter |
An integer value. Defines number of despiking iterations. |
plot |
A logical value. If |
light |
A character string. Selects preferred variable for incoming
light intensity. |
night_thr |
A numeric value that defines the threshold between night
(for |
n |
A numeric value. Number of values within 13-day blocks required to obtain robust statistics. |
z |
A numeric value. MAD scale factor. |
c |
A numeric value. |
var |
A character string. Specifies the variable name in |
Low Frequency Data Despiking is not an additive quality control (QC) test.
despike
follows the QC scheme using QC flag range 0 - 2.
varnames
attribute of returned vector should be chosen to follow the
'Naming Strategy' described in extract_QC
, i.e. to be
distinguished by suffix "_spikesLF"
.
The data frame x
is expected to have certain properties. It is
required that it contains column named "timestamp"
of class
"POSIXt"
with regular sequence of date-time values, typically with
(half-)hourly frequency. Any missing values in "timestamp"
are not
allowed. Thus, if no records exist for given date-time value, it still has to
be included. It also has to contain required (depends on the argument values)
column names. If QC flags are not available for var
, qc_flag
still has to be included in x
as a named column with all values set to
0
(i.e. all values will be checked for outliers).
Only non-missing var
values with corresponding qc_flag
values
below 2
are used to detect the outliers. Missing var
values or
those with assigned flag 2
or NA
are not checked and marked by
NA
flag in the output. Thus NA
values of despike
should
be considered as not checked records and therefore interpreted as 0
flag within the 0 - 2
quality control scheme.
var_thr
is intended for exclusion of data clearly outside of
theoretically acceptable range for the whole dataset. If var_thr
is
specified, var
values below var_thr[1]
and above
var_thr[2]
are marked as spikes (flag 2) in the output. Such values
are further not used for computing statistics on double-differenced time
series.
light
and night_thr
are intended to separate data to night and
day subsets with different statistical properties. NA
s in
x[light]
are thus not allowed due to the subsetting. Despiking is then
applied to individual subsets and combined QC flags are returned.
Despiking is done within blocks of 13 consecutive days to account for seasonality of measured variable. Within each block, all records are compared with its neighbours and d[i] scores are produced. This is achieved by double-differencing:
d[i] = (var[i] - var[i-1]) - (var[i+1] - var[i])
In order to obtain maximum amount of d[i] scores, all missing
var
values are removed from the block before d[i] scores are
produced. var
values are marked as spikes if d[i] is higher
(lower) than median of d[i] scores (M[d]) + (-) scaled median
absolute deviation:
d[i] > M[d] + (z * MAD / 0.6745)
d[i] < M[d] - (z * MAD / 0.6745)
MAD is defined as:
MAD = median(abs(d[i] - M[d]))
The algorithm tends to flag also values that are neighbours of spikes. To
prevent false flagging, median
and mad
of
var
values within given block (M[var] and mad[var],
respectively) is computed. Values can be marked as spikes only if
var[i] > M[var] + (c * mad / 1.4826)
or
var[i] < M[var] - (c * mad / 1.4826)
Number of available double-differenced var
values (nVals
) is
checked within each block. If equal or below nVals
, d[i] or
var[i] values are checked against the statistics computed using entire
dataset to ensure robustness.
The whole process is repeated iteratively if iter > 1
. This way new
statistics are produced for each iteration after exclusion of already
detected outliers and new spikes can be identified.
If plot = FALSE
, an integer vector with attributes
"varnames"
and "units"
. If plot = TRUE
, a list with
elements SD
and plots
. SD
contains identical output as
produced when plot = FALSE
, plots
contains list of
ggplot
objects for respective iteration, light
subset and
13-day period.
Side effect: the counts of spikes detected in each iteration are printed to console.
Plots are produced as a list of ggplot
objects.
Thus they can be assigned to an object and modified as needed before actual
plotting. Each plot consists of two panels. The upper one shows the
double-differenced time series, the bottom one the actual var
values. Grey bands mark the respective intervals in which var
value
cannot be considered as an outlier. The red points in upper panel show all
points that would be marked as spikes if c = 0
. Only the points
marked by blue color (bottom panel) will be considered spikes. The spike
detection tolerance (width of grey bands) can be modified by scale factors
z
(upper panel) and c
(bottom panel).
Mauder, M., Cuntz, M., Drue, C., Graf, A., Rebmann, C., Schmid, H.P., Schmidt, M., Steinbrecher, R., 2013. A strategy for quality and uncertainty assessment of long-term eddy-covariance measurements. Agric. For. Meteorol. 169, 122-135. doi:10.1016/j.agrformet.2012.09.006
Papale, D., Reichstein, M., Canfora, E., Aubinet, M., Bernhofer, C., Longdoz, B., Kutsch, W., Rambal, S., Valentini, R., Vesala, T., Yakir, D., 2006. Towards a more harmonized processing of eddy covariance CO2 fluxes: algorithms and uncertainty estimation. Biogeosciences Discuss. 3, 961-992. doi:10.5194/bgd-3-961-2006
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.