The three following case-studies are tested in detail within FAKIN (i.e. proposed best-practices will be applied for this case studies and cross-checked whether their application is useful.
Two types of datasets are handled in the Geosalz project, spread sheets of mostly
hydrochemical laboratory analysis and archived paper files. Accordingly, the raw
data
and processing
folders are divided in two parts labor
and archive
.
Data processing
Archive data
Import into Endnote DB
Getting data values out of pictures
Lab data
Manual pre-processing: due to heterogeneous formats -> data-cleaning
Automated workflow: data-import, data aggregation and export with R
Folder structure
//server/rawdata geosalz BWB_archiv BWB_labor README.yml //server/processing geosalz archiv labor README.yml documents/ precleaned-data/ cleaned-data/ figures/ <rawdata.ink> (link to “//server/rawdata/geosalz/BWB_labor”) <results.ink> (link to “//server/results/geosalz/report/”) //server/results geosalz admin reports final_report.docx <processing.ink> (link to “//server/processing/geosalz/labor/cleaned_data/v1.0”) README.yml
Folder names indicate the owner of the data, here BWB. The README.yml
gives
information on licensing of the data. In this case for restricted use only.
The BWB_labor
folder contains:
The METADATA.yml
comprises information on the origin of each file. In this case
data was received by email, thus each email is exported as a txt file (select
-
export as…
) and copied to the METADATA.yml
. The METADATA.yml
also contains
the email text itself, which may also provide meta information. The METADATA.yml
makes clear when and from whom the data was send and who received it.
A hyperlink can be inserted that directly links to the corresponding processing
folder. No further subfolders are required.
Workflow for creating above folder structure
Define project acronym geosalz
(add in PROJECTS.yml
)
Create initial folder structure on //servername/rawdata/geosalz
Create initial folder structure on //servername/processing/geosalz
with one
subfolder for each task/work package, i.e.
Create a README.yml
for each task describing the folder contents
Link relevant results to //server/projekte/AUFTRAEGE/Rahmenvertrag GRW-WV/Data and documents/Versalzung
Lessons learnt:
Manual data preparation for heterogeneous data sets indispensable
Naming conventions for large heterogeneous data sets not always in compliance with best-practices
Challenge:
The LCA modelling software Umberto can produce large raw data output files (> 300 MB csv files) that sometimes are even to big for EXCEL 2010 (> 1 millions) but need to be aggregated (e.g. grouped by specific criteria). This was usually performed manually within EXCEL in case that model output data was below EXCEL`s 1 million row limit.
Workflow improvement developed within FAKIN:
An open source R package kwb.umberto was programmed for automating:
data import the Umberto model results,
performing data aggregating to the user needs and finally
exporting the aggregated results in an results.xlsx
EXCEL spreadsheet.
This results.xlsx
EXCEL spreadsheet is referenced by another EXCEL
spreatsheet figues.xlsx
(which contains the figure templates and just links
to the results.xlsx
in order to update the predefined figures).
This workflow now reduces the time consuming and error-prone formerly manually performed data aggregation in EXCEL, whilst still enabling the users to adapt the figures to their needs without coding knowledge.
Challenges:
The output of (on-line) monitoring technologies is often difficult to interpret
and also inconvenient to handle as the output formats of different devices
(in one water treatment scheme) can vary strongly. Furthermore, frequent
reporting and documentation of the treatment performance via (on-line)
monitoring can be time consuming for the personnel and requires advanced
software solutions. An alternative to commercial (and often expensive) software
solutions are tools which are based on the open software R r citep(citation())
. The free software approach allows any R programmer to produce customized tools for each individual end-user.
Thus an automated reporting tool is developed within the AQUANES project for enabling an integrative assessment of the different monitoring devices and integration with water quality data obtained from analysis in the laboratory for four different pilot plant sites in order to:
Increase the reliability and reproducibility of handling large amounts of data by reducing the likeliness in human error in complex systems and by increasing the transparency of the data processing.
Promote the use of customized R tools for different end-user such as utilities, consultants and other research teams.
Therefore the open source R package aquanes.report
r citep(manual["Rustler_2018"])
was programmed, which is able to:
import operational and lab data for each pilot site,
performs temporal aggregation (e.g. 5 min, 1 h, 24h median values),
visualises raw or aggregated data either interactively in a web browser or by
creating a standardised report (e.g. monthly) in html, pdf or docx
For the four different pilot plant sites the data (operational and lab data) for being imported into the R tool came from various sources at different temporal resolutions, which are detailed below:
Haridwar: operational data stored by Autarcon in mySQL database (temporal resolution: ~ 2-3 minutes, i.e. ~ 0.7 million data points per month), which is accessible from the web and thus could be easily imported into R. Lab data was provided by Autarcon initially in a unstructured format, which was impossible to be automatically imported into R. However, after agreeing on a standardised EXCEL spreadsheet format (e.g. one spreadsheet per site, one sheet per parameter and additional sheets providing metadata for parameters and sites) it was possible to integrate the lab data into the R tool.
Basel Lange-Erlen: operational data is provided by the water supplier in EXCEL spreadsheets on a weekly basis for each site (i.e. "Rein" and "Wiese") with a temporal resolution of 5 minutes (i.e. ~ 0.5 million data points per month). Lab data are provided by the water supplier in a single comma separated csv file, which is exported from a database. Thus the structure of the lab data was standardised and could be easily imported into the R tool.
Berlin-Schönerlinde: operational data from the WEDECO pilot plant are collected using a SCADA system (~ temporal resolution: seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.
Berlin-Tiefwerder: operational data from the PENTAIR pilot plant are collected using a SCADA system (~ temporal resolution: ~ seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.
{block2 type="rmdwarning"}
The high temporal high resolution (~ seconds) of the operational data for both
Berlin pilot plants resulted in large data amounts (~ 10 million data points
per month), which required an large effort to optimise the performance of the
R reporting tool in oorder to enable the visualisation of the pilot plant`s raw
data for its test operation period (~ 18 months) on computers with limited RAM
ressources (~ 8-12 GB).
The R tool is used by KWB (for the sites Berlin-Schönerlinde and Berlin-Tiefwerder) regulary for assessing the pilot plants` operational performance interactively. In addition for an advanced assessment only the data importing and aggregation routines and combined with R scripts developed by KWB students.
For the other two pilot plant sites Haridwar and Basel Lange-Erlen the AQUANES project partners use the automated R reporting tool in a similar way.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.