In KWB-R/fakin.doc: Does not matter.

Metadata {#metadata}

Metadata are data about data. It is up to us to define

what metadata (about raw data, processed data, produced plots, documents, reports, etc.) we want to store,
where to store metadata and
in what format to store metadata.

We plan to specify the requirements in more detail when dealing with the test projects. Then, will also check metadata standards.

Metadata about raw data should always be stored.

Metadata about processed data are also important. However, in case of automated processing with a script, it may be possible to deduce the content of a generated file from the content of the script.

Why Metadata? {#metadata-why}

Why should we store metadata about data? Metadata are required to

interpret raw data, e.g. "Why are the oxygen values so high? Ah, I see, someone was there to clean the sensor!",
gain an overview about available data,
know what we are allowed to do with the data.

What Metadata to Store? {#metadata-what}

General {-}

What information would someone need to find/re-use your data? E.g.

Location,
Title,
Creator name,
Description,
Date collected?

Metadata about Raw Data {-}

What are the most important information about raw data that we receive?

Obtained from whom, when, via whom and which medium, e.g.
- E-Mail from A to B on 2018-01-25 or
- USB-Stick given personally from C to D on 2018-01-26
Restriction of usage, e.g.
- only for project x or
- only within KWB or
- must not be published! or
- should be published!!
Description of content and format
- Where measurements were taken?
- What methods were used (to take samples, to analyse parameters in the laboratory)?
- What devices where used?
- What do the columns of the table (in the database, XLS/CSV file) mean?
- In what units are the values given?

Metadata about Processed Data {-}

What are the most important information about the data that we produce?

Who created the file? If the file was created by a script, what script created the file and who ran the script?
When was the file created?
What was the input data (e.g. raw data or preprocessed data)?
Which methods were applied to generate the output from the input?
What was the environment, what were boundary conditions, e.g.
- versions of software,
- versions of R packages?

Metadata about Programming Scripts {-}

What does the script do?
Who wrote the script?
How to use the script? Give a short tutorial.

Regarding R programming, we should consider providing scripts in the form of R packages. The R packaging system defines a framework of how to answer all of the above questions. See how we did this (not yet always with with a tutorial) with our packages on GitHub.

Where to Store Metadata? {#metadata-where}

The two main options of storing metadata are:

together with the data or near to the data i.e. in the same folder in which the data file resides,
in a central file or database.

Unless we have a professional solution (i.e. software on metadata management) we should prefer the first approach which is simpler and more flexible than the second one.

In What Format to Store Metadata? {#metadata-format}

Metadata should be stored in a simple plain (i.e. not formatted) text format. This format can be read and written by any text editor on different operating systems and does not require any specific software. And most important: it is human readable.

In the simplest form, metadata can be stored in a plain text file README.txt. As stated above, this file should reside in the same folder as the files that contain the data to be described.

Please keep in mind: It is better to write something than to write nothing. It does not have to be perfect. Just try the best yout can at the moment that you are storing the data. Try to write the metadata directly after storing the data, do not wait until you "feel in the mood" to do so. Anyone who is later working or planning to work with the data may be grateful finding some information on it.

Better than writing an unstructured text file is to write a structured text file. The so called YAML format is such a structured format. We want to use this standard. It seems to have established in the scientific world. The advantage of a structured format is that reading of this format can be automated. We aim at collecting all available metadata by automatically browsing for YAML files, reading their content and creating overviews on available files and data.

Once you have written a YAML file you can check the validity of the format with this online validator: https://yamlchecker.com/

Metadata Management Tools

Tools for metadata tracking and data standards are:

metadata editor, e.g. online editor of GFZ Potsdam

Metadata Standards {#metadata-standards}

We want to use a metadata standard.

Examples for metadata standards are (in brackets are listed the institutions who publish their data by using this standard):

```{block2, type = 'rmdnote'}

Best-practices roadmap:

Check meta data standards, e. g. DataCite (see also: ZALF, GFZ Potsdam)
Define minimum metadata requirements at KWB for raw and processed data.
projects.

The 'best-practices for metadata' will be developed for the test projects which are assessed within FAKIN project.

We propose to define some special files that contain metadata related to files and folders. To indicate that these files have a special meaning, the file names are all uppercase.

<!-- ***copied from `10_old_german_chapters.Rmd` ***: -->

<!-- Beispiel: -->

<!-- Wenn wir Zeitreihen von den Beliner Wasserbetrieben bekommen, dann haben diese -->
<!-- entweder eine Spalte mit Zeitstempeln oder zwei Spalten mit Zeitstempeln, wobei -->
<!-- die erste den Anfang und die zweite das Ende des Zeitraums enthält, auf den -->
<!-- sich die Messwerte in der entsprechenden Zeile beziehen.  -->

<!-- Wenn nur eine Spalte Zeitstempel enthält, dann ist nicht klar, ob der  -->
<!-- Zeitstempel den Anfang oder das Ende (oder gar einen anderen Zeitpunkt  -->
<!-- innerhalb) des Zeitraums angibt.  -->

<!-- Bei einer Zeitreihe mit fünfminütigen Messwerten, z.B. -->

<!-- ``` -->
<!-- Zeit                Wert -->
<!-- 2018-06-05 16:20:00    0 -->
<!-- 2018-06-05 16:25:00    1 -->
<!-- 2018-06-05 16:30:00    2 -->
<!-- 2018-06-05 16:35:00    3 -->
<!-- 2018-06-05 16:40:00    4 -->
<!-- ``` -->

<!-- ist nicht klar, ob sich der Wert 1 auf den Zeitraum `16:20 - 16:25` oder auf -->
<!-- den Zeitraum `16:25 - 16:30` bezieht. -->

<!-- Dies ist ein Beispiel für eine sehr konkrete Metadatenanforderung. An dieser -->
<!-- Stelle gehen wir meistens von Annahmen aus oder hat jemand von Euch schon mal -->
<!-- eine Aussage darüber von einem BWB-Mitarbeiter erhalten? -->

<!-- Die Kenntnis über den Bezugspunkt des Zeitstempels ist zum Beispiel wichtig beim -->

<!-- * [Erzeugen von Regen-Inputdateien für die Software InfoWorks](#input-infoworks-regen) -->

<!-- In der obigen Beispielzeitreihe ist auch nicht klar, auf welche Zeitzone sich -->
<!-- die Zeitstempel beziehen. Die Zeitzone bzw. die Angabe des UTC-Offsets ist eine -->
<!-- wichtige Metainformation. Am besten wäre es, wenn alle Zeitstempel diese  -->
<!-- Information enthielten, z.B. -->

<!-- ``` -->
<!-- 2018-06-05 16:20:00+02 -->
<!-- 2018-06-05 16:25:00+02 -->
<!-- ``` -->

<!-- Dann bräuchte man diese Information nicht extra zu führen.  -->

<!-- Metadaten werden oft erst bei Bedarf erhoben, nämlich genau dann, wenn man sich -->
<!-- z.B. die obigen Fragen nach Zeitstempelbezug und Zeitzone stellt. Dann wird -->
<!-- nachgefragt und im besten Fall wird eine Antwort darauf gegeben, z.B. in Form -->
<!-- einer E-Mail. Die Metadaten sind dann in der E-Mail enthalten.  -->

<!-- Fragen: -->

<!-- - Wo und unter welchem Namen legen wir diese E-Mail ab? -->

<!-- Beispiel für Metadaten aus meiner persönlichen Logdatei: -->

<!-- > Regendaten gelesen; ca. 10 Impulse gegen 14:00; Dannecker geschrubbt,  -->
<!-- > neuer Akku: Batterie defect, alte Batterie drin gelassen,  -->
<!-- > Daten auslesen nicht möglich; H=55cm -->

<!-- Das müssen wir irgendwie besser hinkriegen, aber wie? -->

<!-- Die grundlegenden Fragen sind: -->

<!-- - Wer ist für die Ablage von Metadaten verantwortlich? -->
<!-- - Wo werden die Metadaten abgelegt? Direkt bei den Daten oder in einer  -->
<!-- übergeordneten Datei/Datenbank? -->
<!-- - In welchem Format werden die Metadaten abgelegt? Einfaches Textformat vs.  -->
<!-- XML-Format, das sich wohl einfach nur mit Hilfe eines Programms editieren lässt. -->


### Special Metadata Files

As stated earlier, we want to use consistent, unique identifiers to indicate the
belonging of data to a certain project or data owner. We propose to define the 
identifiers in terms of special metadata files.

#### Metadata File PROJECTS.txt and related files {#file-projects}

The project identifiers are defined in a simple YAML file `PROJECTS.yml` in
the `//server/projects$` folder. Only the identifiers defined in this file are
expected to appear as top-level folder names in the
[project folder structure](#project-folder-structure) within this network drive.

Possible content of `PROJECTS.yml`:

flusshygiene: department: suw short-name: Flusshygiene long-name: > Hygienically relevant microorganisms and pathogens in multifunctional water bodies and hydrologic circles - Sustainable management of different types of water bodies in Germany financing: funder: bmbf sponsor: bwb ogre: department: suw short-name: OGRE long-name: > Relevance of trace organic substances in stormwater runoff of Berlin financing: funder: uep-2 sponsor: Veolia optiwells-2: department: grw short-name: OPTIWELLS 2 long-name: type: sponsored reliable-sewer: department: suw short-name: RELIABLE_SEWER long-name: type: contracted smartplant: department: wwt short-name: Smartplant type: sponsored ...

In the file `PROJECTS.yml` each entry represents a project. Each project is
identified by its identifier. The project acronyms appear in alphabetical order. 
Each entry should at least contain information on the department, a short/long 
name or title of the project and the type of project (funded, sponsored, 
contracted). Additional information such as the year of the start of the project could be given. 

```{block2 type = "rmdnote"}
It could also be useful to define three letter codes for projects. These codes
could e.g. be used in tables or diagrams in which different projects are 
compared and in which space may be limited.

The department acronyms could be defined in a file DEPARTMENTS.yml:

grw:
  short-name: Groundwater
  head: Hella Schwarzmüller
suw:
  short-name: Urban Systems
  head: Pascale Rouault 
wwt:
  short-name: Process Innovation
  head: Ulf Miehe

The funder acronyms could be defined in a file FUNDERS.yml:

bmbf: German Federal Ministry of Education and Research (BMBF)
uep-2: >
  Umweltentlastungsprogramm des Landes Berlin, co-financed by the European Union
  (UEP II)

Metadata File ORGANISATIONS.txt {#file-organisations}

It is very important to know the origin or owner of data. This is an important piece of metadata. Therefore we define unique identifiers for the owners of data that we use. The acronyms are defined in a special file ORGANISATIONS.txt. Possible content of this file:

bwb: Berliner Wasserbetriebe
kwb: Kompetenzzentrum Wasser Berlin
uba: Umweltbundesamt

KWB-R/fakin.doc documentation built on Sept. 27, 2019, 9:53 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

KWB-R/fakin.doc
Does not matter.

In KWB-R/fakin.doc: Does not matter.

Metadata {#metadata}

Why Metadata? {#metadata-why}

What Metadata to Store? {#metadata-what}

General {-}

Metadata about Raw Data {-}

Metadata about Processed Data {-}

Metadata about Programming Scripts {-}

Where to Store Metadata? {#metadata-where}

In What Format to Store Metadata? {#metadata-format}

Metadata Management Tools

Metadata Standards {#metadata-standards}

Metadata File ORGANISATIONS.txt {#file-organisations}

R Package Documentation

Browse R Packages

We want your feedback!

KWB-R/fakin.doc Does not matter.

In KWB-R/fakin.doc: Does not matter.

Metadata {#metadata}

Why Metadata? {#metadata-why}

What Metadata to Store? {#metadata-what}

General {-}

Metadata about Raw Data {-}

Metadata about Processed Data {-}

Metadata about Programming Scripts {-}

Where to Store Metadata? {#metadata-where}

In What Format to Store Metadata? {#metadata-format}

Metadata Management Tools

Metadata Standards {#metadata-standards}

Metadata File ORGANISATIONS.txt {#file-organisations}

R Package Documentation

Browse R Packages

We want your feedback!

KWB-R/fakin.doc
Does not matter.