HIC.PPFDAutoValidation.CSVfileBatchProcess: Batch process: Despike and autovalidate paired PPFD data and...

Description Usage Arguments Details Value Examples

View source: R/HICFunctionsForCleaningContinuousBioParameters.R

Description

This function was designed to auto-validate paired continuous PPFD data where the two sensors are placed at a fixed distance from each other to calculate the light attenuation coefficient kd. It takes a folder containing pairs of csv files of upper and lower PPFD data and calculates the light attenuation coefficient based on those PPFD values and performs an auto-validation process on those data.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
HIC.PPFDAutoValidation.CSVfileBatchProcess(
  input.directory = NULL,
  sep = ",",
  dec = ".",
  header = T,
  DataUpper = NULL,
  DataLower = NULL,
  Value,
  val.NAvalue = NULL,
  unchecked.state.of.value.code = 110,
  add.original.data = T,
  DateTime = NULL,
  datetime.format = NULL,
  datetime.timezone = "GMT",
  ConditionalMinMaxColumn = NULL,
  ConditionalMinMaxValues = NULL,
  ConditionalMin = NULL,
  ConditionalMax = NULL,
  Min = 0,
  Max = 2000,
  minmax.state.of.value.code = 91,
  sampling.interval = NULL,
  despiked.state.of.value.code = 92,
  good.state.of.value.code = 80,
  despike.threshold = 3,
  despike.Method = "median",
  precision = NULL,
  max.gap = Inf,
  DeletedSpikeOtherSensor.state.of.value.code = 94,
  NotDeletedSpikeBothSensors.state.of.value.code = 95,
  kdDespiked.state.of.value.code = 96,
  DeletedNegativeKd.state.of.value.code = 97,
  NA.state.of.value.code = 255,
  kddlupper = 1,
  kddllower = 0.25,
  Dist.Sensors = 0.4,
  UpperSensorParameterName = "PPFD1",
  LowerSensorParameterName = "PPFD"
)

Arguments

sep

Arguments indicating the formatting of the input csv files. It is the field separator character. Values are separated by this character. By default it is comma ",".

dec

Arguments indicating the formatting of the input csv files. It the character used for decimal points. By default if is a period ".".

header

Arguments indicating the formatting of the input csv files. It is a logical value indicating if the first line is the column titles. By default it is TRUE.

DataUpper

A dataframe object for the upper sensor. If you only wish to process one data frame, then it can be entered directly from the R environment with this argument. If you enter in an input.directory then 'DataUpper = ' will be ignored and the files from the input directory will be processed.

DataLower

A dataframe object for the lower sensor. If you only wish to process one data frame, then it can be entered directly from the R environment with this argument. If you enter in an input.directory then 'DataLower = ' will be ignored and the files from the input directory will be processed.

Value

If the data is from a csv file or a dataframe, it is a quoted character string indicating the column name or an integer indicating the column number of the column containing the data values that you wish to despike. Data may also be entered as a single vector object (unquoted) such as ‘Value = mydata$values’ or ‘Value = values’

val.NAvalue

The value indicating an NA value in your input data. If this value is NA, then this argument can be omitted.

unchecked.state.of.value.code

Number indicating that a given value is unchecked. By default 110.

add.original.data

A logical value indicating if the original input data should be included in the output tables. Note that if you input a csv file, then every column in that file will be kept. TRUE by default.

DateTime

If the data is from a csv file, it is a quoted character string indicating the column name or an integer indicating the column number of the column containing the datetime values of the samples. Data may also be entered as a single vector object (unquoted) such as ‘DateTime = mydata$time’ or ‘DateTime = time’

datetime.format

Character string giving the datetime format. See the strptime() help file for additional help.

datetime.timezone

Character string giving the time zone of the datetime. By default “GMT”. Use OlsonNames() for a list of all time zones.

ConditionalMinMaxColumn

The column name in quotes or column number or vector object that contains the factor variable to base your conditional min max filter on

ConditionalMinMaxValues

A vector containing the factor values to base the conditional min max filter on

ConditionalMin

A vector containing the condition minimums that correspond to the respective values in ConditionalMinMaxValues

ConditionalMax

A vector containing the condition maximums that correspond to the respective values in ConditionalMinMaxValues

Min

Number giving the minimum reasonable value. All values below this will be deleted.

Max

Number giving the maximum reasonable value. All values above this will be deleted.

minmax.state.of.value.code

Number indicating that the value has been deleted during the min max filter. By default 91.

sampling.interval

As numeric, the time between samples. If you enter NULL then it will calculate it for you. By default NULL.

despiked.state.of.value.code

Number indicating that a given value was deleted during the despiking. By default 92.

good.state.of.value.code

Number indicating that a given value has been check and deemed not a spike during the despiking. By default 80.

despike.threshold

Number indicating the threshold for defining a spike. By default it is 3, which corresponds to 3 median absolute deviations or 3 standard deviations.

despike.Method

Character string "median" or “mean” indicating the method to use for the despiking. By default “median”.

precision

A number indicating the precision of the input values. Interpolated values will be rounded to this precision. If left as NULL then the numbers will be rounded to the largest decimal length found in the data.

max.gap

As numeric, the time span of the maximum data gap you wish to interpolate.

DeletedSpikeOtherSensor.state.of.value.code

State of value code given to values deleted because they were deleted in the other sensor.

NotDeletedSpikeBothSensors.state.of.value.code

State of value code given to values that were spikes in both sensors and thus not deleted.

kdDespiked.state.of.value.code

State of value code given to values that were deleted because they were spikes in the light attenuation coefficient kd.

DeletedNegativeKd.state.of.value.code

State of value code given to values that were deleted because the light attenuation coefficient kd was negative.

NA.state.of.value.code

State of value code given to missing data.

kddlupper

Minimum PPFD value of the upper sensor for calculating light attenuation coefficient kd.

kddllower

Maximum PPFD value of the upper sensor for calculating light attenuation coefficient kd.

Dist.Sensors

Distance between PPFD sensors for calculating light attenuation coefficient kd.

UpperSensorParameterName

Upper sensor parameter name in the input file name.

LowerSensorParameterName

Lower sensor parameter name in the input file name.

input.directoryCharacter

string of the path to the folder containing all the csv files that you wish to batch process. This argument may be omitted if you are entering vectors directly into the ‘Value’ and ‘DateTime’ arguments.

Details

All the algorithms in this function are the same as in the function dspk.DespikingWorkflow.CSVfileBatchProcess(). See the documentation for that function for more details.

This function can either batch process paired files in an input directory placed in the argument input.directory or it can process two R objects placed in the arguments DataUpper and Datalower. Specify the input directory in quotes and with forward-slashes(/) or double-back-slashes(\\) but no back-slashes(\). If you copy the directory path from windows, it will have back-slashes(\) and these need to be changed to forward-slashes(/) or double-back-slashes(\\).

Input directory must be a folder containing all the csv files of the PPFD data with the upper and lower sensor data files together with unique station names for each pair of files and consistent separate parameter IDs for the upper and lower sensors also in the file names. For example the files in your directory may be station1_PPFD1.csv, station1_PPFD.csv, station2_PPFD1.csv, station2_PPFD.csv where you have two stations (station1 and station2) and you have an upper sensor ID (PPFD1) and a lower sensor ID (PPFD). You must give the ‘UpperSensorParameterName’ and the ‘LowerSensorParameterName’ in order for the function to know which files belong to which sensor. This is not a generic function because it assumes that the components of the csv file names are separated by underscore(_) and that when the files are arranged alphabetically, the upper and lower sensor pairs will be next to each other.

Please see the FunctionLogFile.txt that was generated to see any error messages and details about the selected preferences and calculated preferences.

Default state of value codes

255 missing data
110 unchecked data
80 good data
91 deleted, min max filter
92 deleted, despiked
94 deleted, spike in other sensor
95 not deleted, spike in both sensors
96 deleted, spike in kd
97 deleted, negative kd

General work flow

Preprocess-formatting. Min max filter on PPFD. Despike PPFD data. All values deleted in one sensor dataset must be deleted in the other as well. Restore original values where spikes are in both sensor datasets. Data gap interpolation PPFD (default max gap to interpolate 1 hour). Calculate light attenuation coefficient kd. Delete PPFD values where kd is negative. Remove kd values where PPFD is below detection limit. Despike kd and delete those spikes from both the PPFD data and the kd data. Data gap interpolation PPFD again. Calculate light attenuation kd again. Delete kd values where PPFD is below detection limit.

Detailed work flow overview

All data is saved to a folder ‘autoPPFDdespikeYYYYMMDDHHMMSS’ in your working directory. Formatted data is saved to the subfolder ‘preprocFormat’. The columns ‘dspk.Values’, ’dspk.DateTimeNum’, and ’dspk.StateOfValue’ are created to not overwrite the original data. The original data is saved into the new column 'orig.values' for ease of later reference. Unchecked data is given the state of value code 110 (default) and missing data is coded as 255 (default).

PPFD data is first min/max filtered removing unreasonably high and low values (defaults are min 0 corresponding to no light and max 2000 corresponding to full sun). Deleted values are coded as 91 (default).

PPFD data is then despiked. Checked values coded as 80 (default), deleted as 92 (default) and unchecked values are left with their original state of value code. See documentation for dspk.DespikingWorkflow.CSVfileBatchProcess() for details on the despiking algorithm.

Merge the two datasets joining on time. The upper sensors column have the suffix x and the lower sensors columns have the suffix y.

All values that were deleted in either the upper sensor’s dataset or the lower sensors dataset must be deleted in both datasets and coded as 94 (default). This must be done because the two datasets are compared to each other to calculate light attenuation coefficient kd.

If a spike was detected in both the upper and lower datasets, restore their original values and code as 95 (default). Spikes were often found in both datasets simultaneously which means it wasn’t sensor error. These may have been passing clouds or plumes of suspended matter which is important data for understanding the light climate.

Despiked PPFD data is saved to the subfolder ‘step2Despike’.

PPFD data gaps of maximum one hour (default) are linear interpolated. The stat of value codes are not changed. If the state of value code says it was deleted or was missing but there is a value, then it can be assumed that it was interpolated.

Light attenuation coefficient kd is calculated as kd = 1/Δz*ln(E1/E2) where Δz is the distance between sensors in meters 0.4m (default) and E1 is the upper sensor PPFD and E2 is the lower sensor PPFD. Saved to column ‘kd’. Unite m^-1.

Data saved as csv files to the subdirectory ‘step3Interpol.kd’.

Delete all PPFD data values where kd is negative. State of value code 97 (default). Light cannot be greater lower in the water column.

Copy the column ‘kd’ to the new column 'dspk.kd' and remove all values from 'dspk.kd' outside detection limits 1 PPFD (default) for the upper sensor and 0.25 PPFD (default) for the lower sensor. When the light levels approach zero, it becomes too difficult to accurately measure the difference between the upper and lower sensors.

Despike ‘dspk.kd’ and delete those spikes also from the PPFD columns 'dspk.Values.x' and 'dspk.Values.y'. State of value code 96 (default).

Make a unified state of value for kd. Copy ‘dspk.StateOfValue.x’ into the new column ‘dspk.StateOfValue’ and make all 91 and 92 codes the code 94 (these are the default codes).

Interpolate PPFD data gaps of maximum one hour (default) again.

Calculate kd light attenuation coefficient again in column ‘dspk.kd’.

Delete all ‘dspk.kd’ values where the PPFD values are outside the detection limit: upper sensor PPFD is less than 1 (default) and lower sensor is less than 0.25 (default).

Save the final dataset in subfolder ‘step4Despikekd.FinalData’

Value

Each pair of input files gets outputted as one csv file. All data is saved to a folder ‘/autoPPFDdespikeYYYYMMDDHHMMSS’ in your working directory. The final data will be in the subfolder /step4Despikekd.FinalData. Upper sensor data has suffix .x and lower sensor data has suffix .y. The original PPFD data is in columns orig.values.x and orig.values.y The despiked PPFD data is in columns dspk.Values.x and dspk.Values.y. The PPFD state of value codes are in columns dspk.StateOfValue.x and dspk.StateOfValue.y The despiked light attenuation coefficient kd values are in column dspk.kd. The state of value codes for kd are in column dspk.StateOfValue. The UNIX seconds datetime is in column dspk.DateTimeNum.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#HIC data cleaning and validation protocol
#Batch process HIC database PPFD files that were formatted
#with the HIC.Continuous.Data.Import.Format() function.
HIC.PPFDAutoValidation.CSVfileBatchProcess(
    input.directory = "C:/Rdata/PPFDdata",
    Value = "Value", val.NAvalue = -777, #all values of -777 will be set to NA
    DateTime = "DateTimeUnix",
    max.gap = 900) #900 seconds or 15 minutes maximum gap to interpolate

#Process the r object tables PPFDuppersensor and PPFDlowersensor on the column "Value".
HIC.PPFDAutoValidation.CSVfileBatchProcess(DataUpper = PPFDuppersensor,
    DataLower = PPFDlowersensor, Value = "Value")

pgelsomini/HICbioclean documentation built on Dec. 28, 2021, 5:22 p.m.