Metadata are data about data. It is up to us to define
what metadata (about raw data, processed data, produced plots, documents, reports, etc.) we want to store,
where to store metadata and
in what format to store metadata.
We plan to specify the requirements in more detail when dealing with the test projects. Then, will also check metadata standards.
Metadata about raw data should always be stored.
Metadata about processed data are also important. However, in case of automated processing with a script, it may be possible to deduce the content of a generated file from the content of the script.
Why should we store metadata about data? Metadata are required to
interpret raw data, e.g. "Why are the oxygen values so high? Ah, I see, someone was there to clean the sensor!",
gain an overview about available data,
know what we are allowed to do with the data.
What information would someone need to find/re-use your data? E.g.
Location,
Title,
Creator name,
Description,
Date collected?
What are the most important information about raw data that we receive?
Obtained from whom, when, via whom and which medium, e.g.
E-Mail from A to B on 2018-01-25
or
USB-Stick given personally from C to D on 2018-01-26
Restriction of usage, e.g.
only for project x
or
only within KWB
or
must not be published!
or
should be published!
!
Description of content and format
Where measurements were taken?
What methods were used (to take samples, to analyse parameters in the laboratory)?
What devices where used?
What do the columns of the table (in the database, XLS/CSV file) mean?
In what units are the values given?
What are the most important information about the data that we produce?
Who created the file? If the file was created by a script, what script created the file and who ran the script?
When was the file created?
What was the input data (e.g. raw data or preprocessed data)?
Which methods were applied to generate the output from the input?
What was the environment, what were boundary conditions, e.g.
versions of software,
versions of R packages?
What does the script do?
Who wrote the script?
How to use the script? Give a short tutorial.
Regarding R programming, we should consider providing scripts in the form of R packages. The R packaging system defines a framework of how to answer all of the above questions. See how we did this (not yet always with with a tutorial) with our packages on GitHub.
The two main options of storing metadata are:
together with the data or near to the data i.e. in the same folder in which the data file resides,
in a central file or database.
Unless we have a professional solution (i.e. software on metadata management) we should prefer the first approach which is simpler and more flexible than the second one.
Metadata should be stored in a simple plain (i.e. not formatted) text format. This format can be read and written by any text editor on different operating systems and does not require any specific software. And most important: it is human readable.
In the simplest form, metadata can be stored in a plain text file README.txt
.
As stated above, this file should reside in the same folder as the files that
contain the data to be described.
Please keep in mind: It is better to write something than to write nothing. It does not have to be perfect. Just try the best yout can at the moment that you are storing the data. Try to write the metadata directly after storing the data, do not wait until you "feel in the mood" to do so. Anyone who is later working or planning to work with the data may be grateful finding some information on it.
Better than writing an unstructured text file is to write a structured text file. The so called YAML format is such a structured format. We want to use this standard. It seems to have established in the scientific world. The advantage of a structured format is that reading of this format can be automated. We aim at collecting all available metadata by automatically browsing for YAML files, reading their content and creating overviews on available files and data.
Once you have written a YAML file you can check the validity of the format with this online validator: https://yamlchecker.com/
Tools for metadata tracking and data standards are:
We want to use a metadata standard.
Examples for metadata standards are (in brackets are listed the institutions who publish their data by using this standard):
```{block2, type = 'rmdnote'}
Best-practices roadmap:
Check meta data standards, e. g. DataCite (see also: ZALF, GFZ Potsdam)
Define minimum metadata requirements at KWB for raw and processed data.
projects.
The 'best-practices for metadata' will be developed for the test projects which are assessed within FAKIN project.
We propose to define some special files that contain metadata related to files and folders. To indicate that these files have a special meaning, the file names are all uppercase.
<!-- ***copied from `10_old_german_chapters.Rmd` ***: --> <!-- Beispiel: --> <!-- Wenn wir Zeitreihen von den Beliner Wasserbetrieben bekommen, dann haben diese --> <!-- entweder eine Spalte mit Zeitstempeln oder zwei Spalten mit Zeitstempeln, wobei --> <!-- die erste den Anfang und die zweite das Ende des Zeitraums enthält, auf den --> <!-- sich die Messwerte in der entsprechenden Zeile beziehen. --> <!-- Wenn nur eine Spalte Zeitstempel enthält, dann ist nicht klar, ob der --> <!-- Zeitstempel den Anfang oder das Ende (oder gar einen anderen Zeitpunkt --> <!-- innerhalb) des Zeitraums angibt. --> <!-- Bei einer Zeitreihe mit fünfminütigen Messwerten, z.B. --> <!-- ``` --> <!-- Zeit Wert --> <!-- 2018-06-05 16:20:00 0 --> <!-- 2018-06-05 16:25:00 1 --> <!-- 2018-06-05 16:30:00 2 --> <!-- 2018-06-05 16:35:00 3 --> <!-- 2018-06-05 16:40:00 4 --> <!-- ``` --> <!-- ist nicht klar, ob sich der Wert 1 auf den Zeitraum `16:20 - 16:25` oder auf --> <!-- den Zeitraum `16:25 - 16:30` bezieht. --> <!-- Dies ist ein Beispiel für eine sehr konkrete Metadatenanforderung. An dieser --> <!-- Stelle gehen wir meistens von Annahmen aus oder hat jemand von Euch schon mal --> <!-- eine Aussage darüber von einem BWB-Mitarbeiter erhalten? --> <!-- Die Kenntnis über den Bezugspunkt des Zeitstempels ist zum Beispiel wichtig beim --> <!-- * [Erzeugen von Regen-Inputdateien für die Software InfoWorks](#input-infoworks-regen) --> <!-- In der obigen Beispielzeitreihe ist auch nicht klar, auf welche Zeitzone sich --> <!-- die Zeitstempel beziehen. Die Zeitzone bzw. die Angabe des UTC-Offsets ist eine --> <!-- wichtige Metainformation. Am besten wäre es, wenn alle Zeitstempel diese --> <!-- Information enthielten, z.B. --> <!-- ``` --> <!-- 2018-06-05 16:20:00+02 --> <!-- 2018-06-05 16:25:00+02 --> <!-- ``` --> <!-- Dann bräuchte man diese Information nicht extra zu führen. --> <!-- Metadaten werden oft erst bei Bedarf erhoben, nämlich genau dann, wenn man sich --> <!-- z.B. die obigen Fragen nach Zeitstempelbezug und Zeitzone stellt. Dann wird --> <!-- nachgefragt und im besten Fall wird eine Antwort darauf gegeben, z.B. in Form --> <!-- einer E-Mail. Die Metadaten sind dann in der E-Mail enthalten. --> <!-- Fragen: --> <!-- - Wo und unter welchem Namen legen wir diese E-Mail ab? --> <!-- Beispiel für Metadaten aus meiner persönlichen Logdatei: --> <!-- > Regendaten gelesen; ca. 10 Impulse gegen 14:00; Dannecker geschrubbt, --> <!-- > neuer Akku: Batterie defect, alte Batterie drin gelassen, --> <!-- > Daten auslesen nicht möglich; H=55cm --> <!-- Das müssen wir irgendwie besser hinkriegen, aber wie? --> <!-- Die grundlegenden Fragen sind: --> <!-- - Wer ist für die Ablage von Metadaten verantwortlich? --> <!-- - Wo werden die Metadaten abgelegt? Direkt bei den Daten oder in einer --> <!-- übergeordneten Datei/Datenbank? --> <!-- - In welchem Format werden die Metadaten abgelegt? Einfaches Textformat vs. --> <!-- XML-Format, das sich wohl einfach nur mit Hilfe eines Programms editieren lässt. --> ### Special Metadata Files As stated earlier, we want to use consistent, unique identifiers to indicate the belonging of data to a certain project or data owner. We propose to define the identifiers in terms of special metadata files. #### Metadata File PROJECTS.txt and related files {#file-projects} The project identifiers are defined in a simple YAML file `PROJECTS.yml` in the `//server/projects$` folder. Only the identifiers defined in this file are expected to appear as top-level folder names in the [project folder structure](#project-folder-structure) within this network drive. Possible content of `PROJECTS.yml`:
flusshygiene: department: suw short-name: Flusshygiene long-name: > Hygienically relevant microorganisms and pathogens in multifunctional water bodies and hydrologic circles - Sustainable management of different types of water bodies in Germany financing: funder: bmbf sponsor: bwb ogre: department: suw short-name: OGRE long-name: > Relevance of trace organic substances in stormwater runoff of Berlin financing: funder: uep-2 sponsor: Veolia optiwells-2: department: grw short-name: OPTIWELLS 2 long-name: type: sponsored reliable-sewer: department: suw short-name: RELIABLE_SEWER long-name: type: contracted smartplant: department: wwt short-name: Smartplant type: sponsored ...
In the file `PROJECTS.yml` each entry represents a project. Each project is identified by its identifier. The project acronyms appear in alphabetical order. Each entry should at least contain information on the department, a short/long name or title of the project and the type of project (funded, sponsored, contracted). Additional information such as the year of the start of the project could be given. ```{block2 type = "rmdnote"} It could also be useful to define three letter codes for projects. These codes could e.g. be used in tables or diagrams in which different projects are compared and in which space may be limited.
The department acronyms could be defined in a file DEPARTMENTS.yml
:
grw: short-name: Groundwater head: Hella Schwarzmüller suw: short-name: Urban Systems head: Pascale Rouault wwt: short-name: Process Innovation head: Ulf Miehe
The funder acronyms could be defined in a file FUNDERS.yml
:
bmbf: German Federal Ministry of Education and Research (BMBF) uep-2: > Umweltentlastungsprogramm des Landes Berlin, co-financed by the European Union (UEP II)
It is very important to know the origin or owner of data. This is an important
piece of metadata. Therefore we define unique identifiers for the
owners of data that we use. The acronyms are defined in a special file
ORGANISATIONS.txt
. Possible content of this file:
bwb: Berliner Wasserbetriebe kwb: Kompetenzzentrum Wasser Berlin uba: Umweltbundesamt
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.