In KWB-R/fakin.doc: Does not matter.

Test projects

The three following case-studies are tested in detail within FAKIN (i.e. proposed best-practices will be applied for this case studies and cross-checked whether their application is useful.

Geogenic Salination

Two types of datasets are handled in the Geosalz project, spread sheets of mostly hydrochemical laboratory analysis and archived paper files. Accordingly, the raw data and processing folders are divided in two parts labor and archive.

Data processing

Archive data
Import into Endnote DB
Getting data values out of pictures
Lab data
Manual pre-processing: due to heterogeneous formats -> data-cleaning
Automated workflow: data-import, data aggregation and export with R

Folder structure

//server/rawdata
  geosalz
    BWB_archiv
    BWB_labor 
    README.yml


//server/processing
  geosalz
    archiv
    labor 
      README.yml 
      documents/
      precleaned-data/
      cleaned-data/
      figures/
      <rawdata.ink> (link to “//server/rawdata/geosalz/BWB_labor”)
      <results.ink> (link to “//server/results/geosalz/report/”)

//server/results
  geosalz
    admin
    reports
      final_report.docx
      <processing.ink> (link to “//server/processing/geosalz/labor/cleaned_data/v1.0”)
      README.yml

Folder names indicate the owner of the data, here BWB. The README.yml gives information on licensing of the data. In this case for restricted use only. The BWB_labor folder contains:

The METADATA.yml comprises information on the origin of each file. In this case data was received by email, thus each email is exported as a txt file (select - export as…) and copied to the METADATA.yml. The METADATA.yml also contains the email text itself, which may also provide meta information. The METADATA.yml makes clear when and from whom the data was send and who received it. A hyperlink can be inserted that directly links to the corresponding processing folder. No further subfolders are required.

Workflow for creating above folder structure

Define project acronym geosalz (add in PROJECTS.yml)
Create initial folder structure on //servername/rawdata/geosalz
Create initial folder structure on //servername/processing/geosalz with one subfolder for each task/work package, i.e.
Create a README.yml for each task describing the folder contents
Link relevant results to //server/projekte/AUFTRAEGE/Rahmenvertrag GRW-WV/Data and documents/Versalzung

Lessons learnt:

Manual data preparation for heterogeneous data sets indispensable
Naming conventions for large heterogeneous data sets not always in compliance with best-practices

LCA Modelling

Challenge:

The LCA modelling software Umberto can produce large raw data output files (> 300 MB csv files) that sometimes are even to big for EXCEL 2010 (> 1 millions) but need to be aggregated (e.g. grouped by specific criteria). This was usually performed manually within EXCEL in case that model output data was below EXCEL`s 1 million row limit.

Workflow improvement developed within FAKIN:

An open source R package kwb.umberto was programmed for automating:

data import the Umberto model results,
performing data aggregating to the user needs and finally
exporting the aggregated results in an results.xlsx EXCEL spreadsheet.

This results.xlsx EXCEL spreadsheet is referenced by another EXCEL spreatsheet figues.xlsx (which contains the figure templates and just links to the results.xlsx in order to update the predefined figures).

This workflow now reduces the time consuming and error-prone formerly manually performed data aggregation in EXCEL, whilst still enabling the users to adapt the figures to their needs without coding knowledge.

Pilot Plants {#aquanes}

Challenges:

The output of (on-line) monitoring technologies is often difficult to interpret and also inconvenient to handle as the output formats of different devices (in one water treatment scheme) can vary strongly. Furthermore, frequent reporting and documentation of the treatment performance via (on-line) monitoring can be time consuming for the personnel and requires advanced software solutions. An alternative to commercial (and often expensive) software solutions are tools which are based on the open software R r citep(citation()). The free software approach allows any R programmer to produce customized tools for each individual end-user.

Thus an automated reporting tool is developed within the AQUANES project for enabling an integrative assessment of the different monitoring devices and integration with water quality data obtained from analysis in the laboratory for four different pilot plant sites in order to:

Increase the reliability and reproducibility of handling large amounts of data by reducing the likeliness in human error in complex systems and by increasing the transparency of the data processing.
Promote the use of customized R tools for different end-user such as utilities, consultants and other research teams.

Therefore the open source R package aquanes.report r citep(manual["Rustler_2018"]) was programmed, which is able to:

import operational and lab data for each pilot site,
performs temporal aggregation (e.g. 5 min, 1 h, 24h median values),
visualises raw or aggregated data either interactively in a web browser or by
creating a standardised report (e.g. monthly) in html, pdf or docx

For the four different pilot plant sites the data (operational and lab data) for being imported into the R tool came from various sources at different temporal resolutions, which are detailed below:

Haridwar: operational data stored by Autarcon in mySQL database (temporal resolution: ~ 2-3 minutes, i.e. ~ 0.7 million data points per month), which is accessible from the web and thus could be easily imported into R. Lab data was provided by Autarcon initially in a unstructured format, which was impossible to be automatically imported into R. However, after agreeing on a standardised EXCEL spreadsheet format (e.g. one spreadsheet per site, one sheet per parameter and additional sheets providing metadata for parameters and sites) it was possible to integrate the lab data into the R tool.
Basel Lange-Erlen: operational data is provided by the water supplier in EXCEL spreadsheets on a weekly basis for each site (i.e. "Rein" and "Wiese") with a temporal resolution of 5 minutes (i.e. ~ 0.5 million data points per month). Lab data are provided by the water supplier in a single comma separated csv file, which is exported from a database. Thus the structure of the lab data was standardised and could be easily imported into the R tool.
Berlin-Schönerlinde: operational data from the WEDECO pilot plant are collected using a SCADA system (~ temporal resolution: seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.
Berlin-Tiefwerder: operational data from the PENTAIR pilot plant are collected using a SCADA system (~ temporal resolution: ~ seconds, i.e. ~ 10 million data points per month). Lab data are provided by BWB in a single EXCEL spreadsheet. However, its structure often changes in case it is updated by BWB, making an automated importing using R impossible without adapting the import functions. Thus lab data were not integrated in the R tool for this site.

{block2 type="rmdwarning"} The high temporal high resolution (~ seconds) of the operational data for both Berlin pilot plants resulted in large data amounts (~ 10 million data points per month), which required an large effort to optimise the performance of the R reporting tool in oorder to enable the visualisation of the pilot plant`s raw data for its test operation period (~ 18 months) on computers with limited RAM ressources (~ 8-12 GB).

The R tool is used by KWB (for the sites Berlin-Schönerlinde and Berlin-Tiefwerder) regulary for assessing the pilot plants` operational performance interactively. In addition for an advanced assessment only the data importing and aggregation routines and combined with R scripts developed by KWB students.

For the other two pilot plant sites Haridwar and Basel Lange-Erlen the AQUANES project partners use the automated R reporting tool in a similar way.

KWB-R/fakin.doc documentation built on Sept. 27, 2019, 9:53 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com