Metadata {#metadata}

Metadata are data about data. It is up to us to define

We plan to specify the requirements in more detail when dealing with the test projects. Then, will also check metadata standards.

Metadata about raw data should always be stored.

Metadata about processed data are also important. However, in case of automated processing with a script, it may be possible to deduce the content of a generated file from the content of the script.

Why Metadata? {#metadata-why}

Why should we store metadata about data? Metadata are required to

What Metadata to Store? {#metadata-what}

General {-}

What information would someone need to find/re-use your data? E.g.

Metadata about Raw Data {-}

What are the most important information about raw data that we receive?

Metadata about Processed Data {-}

What are the most important information about the data that we produce?

Metadata about Programming Scripts {-}

Regarding R programming, we should consider providing scripts in the form of R packages. The R packaging system defines a framework of how to answer all of the above questions. See how we did this (not yet always with with a tutorial) with our packages on GitHub.

Where to Store Metadata? {#metadata-where}

The two main options of storing metadata are:

  1. together with the data or near to the data i.e. in the same folder in which the data file resides,

  2. in a central file or database.

Unless we have a professional solution (i.e. software on metadata management) we should prefer the first approach which is simpler and more flexible than the second one.

In What Format to Store Metadata? {#metadata-format}

Metadata should be stored in a simple plain (i.e. not formatted) text format. This format can be read and written by any text editor on different operating systems and does not require any specific software. And most important: it is human readable.

In the simplest form, metadata can be stored in a plain text file README.txt. As stated above, this file should reside in the same folder as the files that contain the data to be described.

Please keep in mind: It is better to write something than to write nothing. It does not have to be perfect. Just try the best yout can at the moment that you are storing the data. Try to write the metadata directly after storing the data, do not wait until you "feel in the mood" to do so. Anyone who is later working or planning to work with the data may be grateful finding some information on it.

Better than writing an unstructured text file is to write a structured text file. The so called YAML format is such a structured format. We want to use this standard. It seems to have established in the scientific world. The advantage of a structured format is that reading of this format can be automated. We aim at collecting all available metadata by automatically browsing for YAML files, reading their content and creating overviews on available files and data.

Once you have written a YAML file you can check the validity of the format with this online validator: https://yamlchecker.com/

Metadata Management Tools

Tools for metadata tracking and data standards are:

Metadata Standards {#metadata-standards}

We want to use a metadata standard.

Examples for metadata standards are (in brackets are listed the institutions who publish their data by using this standard):

```{block2, type = 'rmdnote'}

Best-practices roadmap:

  1. Check meta data standards, e. g. DataCite (see also: ZALF, GFZ Potsdam)

  2. Define minimum metadata requirements at KWB for raw and processed data.
    projects.

The 'best-practices for metadata' will be developed for the test projects which are assessed within FAKIN project.

We propose to define some special files that contain metadata related to files and folders. To indicate that these files have a special meaning, the file names are all uppercase.

<!-- ***copied from `10_old_german_chapters.Rmd` ***: -->

<!-- Beispiel: -->

<!-- Wenn wir Zeitreihen von den Beliner Wasserbetrieben bekommen, dann haben diese -->
<!-- entweder eine Spalte mit Zeitstempeln oder zwei Spalten mit Zeitstempeln, wobei -->
<!-- die erste den Anfang und die zweite das Ende des Zeitraums enthält, auf den -->
<!-- sich die Messwerte in der entsprechenden Zeile beziehen.  -->

<!-- Wenn nur eine Spalte Zeitstempel enthält, dann ist nicht klar, ob der  -->
<!-- Zeitstempel den Anfang oder das Ende (oder gar einen anderen Zeitpunkt  -->
<!-- innerhalb) des Zeitraums angibt.  -->

<!-- Bei einer Zeitreihe mit fünfminütigen Messwerten, z.B. -->

<!-- ``` -->
<!-- Zeit                Wert -->
<!-- 2018-06-05 16:20:00    0 -->
<!-- 2018-06-05 16:25:00    1 -->
<!-- 2018-06-05 16:30:00    2 -->
<!-- 2018-06-05 16:35:00    3 -->
<!-- 2018-06-05 16:40:00    4 -->
<!-- ``` -->

<!-- ist nicht klar, ob sich der Wert 1 auf den Zeitraum `16:20 - 16:25` oder auf -->
<!-- den Zeitraum `16:25 - 16:30` bezieht. -->

<!-- Dies ist ein Beispiel für eine sehr konkrete Metadatenanforderung. An dieser -->
<!-- Stelle gehen wir meistens von Annahmen aus oder hat jemand von Euch schon mal -->
<!-- eine Aussage darüber von einem BWB-Mitarbeiter erhalten? -->

<!-- Die Kenntnis über den Bezugspunkt des Zeitstempels ist zum Beispiel wichtig beim -->

<!-- * [Erzeugen von Regen-Inputdateien für die Software InfoWorks](#input-infoworks-regen) -->

<!-- In der obigen Beispielzeitreihe ist auch nicht klar, auf welche Zeitzone sich -->
<!-- die Zeitstempel beziehen. Die Zeitzone bzw. die Angabe des UTC-Offsets ist eine -->
<!-- wichtige Metainformation. Am besten wäre es, wenn alle Zeitstempel diese  -->
<!-- Information enthielten, z.B. -->

<!-- ``` -->
<!-- 2018-06-05 16:20:00+02 -->
<!-- 2018-06-05 16:25:00+02 -->
<!-- ``` -->

<!-- Dann bräuchte man diese Information nicht extra zu führen.  -->

<!-- Metadaten werden oft erst bei Bedarf erhoben, nämlich genau dann, wenn man sich -->
<!-- z.B. die obigen Fragen nach Zeitstempelbezug und Zeitzone stellt. Dann wird -->
<!-- nachgefragt und im besten Fall wird eine Antwort darauf gegeben, z.B. in Form -->
<!-- einer E-Mail. Die Metadaten sind dann in der E-Mail enthalten.  -->

<!-- Fragen: -->

<!-- - Wo und unter welchem Namen legen wir diese E-Mail ab? -->

<!-- Beispiel für Metadaten aus meiner persönlichen Logdatei: -->

<!-- > Regendaten gelesen; ca. 10 Impulse gegen 14:00; Dannecker geschrubbt,  -->
<!-- > neuer Akku: Batterie defect, alte Batterie drin gelassen,  -->
<!-- > Daten auslesen nicht möglich; H=55cm -->

<!-- Das müssen wir irgendwie besser hinkriegen, aber wie? -->

<!-- Die grundlegenden Fragen sind: -->

<!-- - Wer ist für die Ablage von Metadaten verantwortlich? -->
<!-- - Wo werden die Metadaten abgelegt? Direkt bei den Daten oder in einer  -->
<!-- übergeordneten Datei/Datenbank? -->
<!-- - In welchem Format werden die Metadaten abgelegt? Einfaches Textformat vs.  -->
<!-- XML-Format, das sich wohl einfach nur mit Hilfe eines Programms editieren lässt. -->


### Special Metadata Files

As stated earlier, we want to use consistent, unique identifiers to indicate the
belonging of data to a certain project or data owner. We propose to define the 
identifiers in terms of special metadata files.

#### Metadata File PROJECTS.txt and related files {#file-projects}

The project identifiers are defined in a simple YAML file `PROJECTS.yml` in
the `//server/projects$` folder. Only the identifiers defined in this file are
expected to appear as top-level folder names in the
[project folder structure](#project-folder-structure) within this network drive.

Possible content of `PROJECTS.yml`:

flusshygiene: department: suw short-name: Flusshygiene long-name: > Hygienically relevant microorganisms and pathogens in multifunctional water bodies and hydrologic circles - Sustainable management of different types of water bodies in Germany financing: funder: bmbf sponsor: bwb ogre: department: suw short-name: OGRE long-name: > Relevance of trace organic substances in stormwater runoff of Berlin financing: funder: uep-2 sponsor: Veolia optiwells-2: department: grw short-name: OPTIWELLS 2 long-name: type: sponsored reliable-sewer: department: suw short-name: RELIABLE_SEWER long-name: type: contracted smartplant: department: wwt short-name: Smartplant type: sponsored ...

In the file `PROJECTS.yml` each entry represents a project. Each project is
identified by its identifier. The project acronyms appear in alphabetical order. 
Each entry should at least contain information on the department, a short/long 
name or title of the project and the type of project (funded, sponsored, 
contracted). Additional information such as the year of the start of the project could be given. 

```{block2 type = "rmdnote"}
It could also be useful to define three letter codes for projects. These codes
could e.g. be used in tables or diagrams in which different projects are 
compared and in which space may be limited.

The department acronyms could be defined in a file DEPARTMENTS.yml:

grw:
  short-name: Groundwater
  head: Hella Schwarzmüller
suw:
  short-name: Urban Systems
  head: Pascale Rouault 
wwt:
  short-name: Process Innovation
  head: Ulf Miehe

The funder acronyms could be defined in a file FUNDERS.yml:

bmbf: German Federal Ministry of Education and Research (BMBF)
uep-2: >
  Umweltentlastungsprogramm des Landes Berlin, co-financed by the European Union
  (UEP II)

Metadata File ORGANISATIONS.txt {#file-organisations}

It is very important to know the origin or owner of data. This is an important piece of metadata. Therefore we define unique identifiers for the owners of data that we use. The acronyms are defined in a special file ORGANISATIONS.txt. Possible content of this file:

bwb: Berliner Wasserbetriebe
kwb: Kompetenzzentrum Wasser Berlin
uba: Umweltbundesamt


KWB-R/fakin.doc documentation built on Sept. 27, 2019, 9:53 p.m.