Best Practices

A logical and consistent folder structure, naming conventions, versioning rules as well as provision of metadata for your data files help you and your colleagues find and use the data.

The following rules will save time on the long run and will help avoid data duplication. With the following recommendations we intend to comply with r citet("10.1371/journal.pcbi.1005097").

In general, data related to one project should be accessible to all the persons that are involved in the project. It should be avoided that project relevant files are permanently stored on local hard drives or on personal network drives. Project data should therefore be stored on network drives to which all project members have access.

Folder Structure

A research institute aims at creating knowledge from data. A typical data flow may look like the following: raw data are received or collected from foreign sources or created from own measurements. The raw data are processed, i.e. cleaned, aggregated, filtered and analysed. Finally result data sets are composed from which conclusions are drawn and published. To cut it short: raw data get processed and become result data.

In computer science, the clear distinction between data input, data processing and data output is a well-known and widely-used model to describe the structure of an information processing program. It is referred to as the input-process-output (IPO) model. We recommend to apply this model to the distinction of different categories of data:

Using the IPO model has the following benefits:

According to the three categories we suggest to create three different areas, represented by three different network drives, on the top level of your file server. The first area is for raw data, the second area is for data in processing and the third area is for result data.

Within each area, the data are organised by project first, i.e. each project is represented by one folder within each of the network drives:

//server/rawdata
    project-1/
    project-2/
    …

//server/processing
    project-1/
    project-2/
    …

//server/results
    project-1/
    project-2/
    …

Sub-folder structure within the project folders in each of these top-level network drives is described in the following.

Raw data {-}

As raw data we define data that we receive from a measurement device, project partner or other external sources (e.g. internet download) even if these data were processed externally. Raw data are, for example, analytical measurements from laboratories, filled out questionnaires from project partners, meteorological data from other institutes, measurements from loggers or sensors, or scans of hand written manual sampling protocols. Especially in environmental sciences, raw data often cannot or only with high costs be reproduced (e.g. rainfall, river discharge measurements). They are therefore of high value. Raw data are often large in terms of file size or number (e.g. measurements of devices logging at high temporal resolution). Raw data can already come in a complex, deep folder structure. Raw data are closely related to metadata such as sensor configurations generated by loggers or email correspondence when receiving data by email from external partners. We acknowledge the high value of raw data by storing them on a dedicated, protected space and by requiring them to be accompanied by metadata.

Raw data are stored in an unmodified state. All modifications of the data are to be done on a copy of the raw data in the “processing” space (see next). The only modification allowed is the renaming of a raw data file, given that the renaming is documented in the metadata. Once stored, raw data are to be protected from being accidentally deleted or modified. This is achieved by making the raw data space write-protected ([link] FAQ: How to make a file write protected).

We propose to organise raw data:

We will create a network folder //server/rawdata in which all files have set the read-only property. We suggest to store raw data by project first and by the organisation that owns (i.e. generated, provided) the data second. This could look like this:

//server/rawdata

  ORGANISATIONS.txt

  PROJECTS.lnk [Symbolic Link to PROJECTS.txt in //server/projects$]

  test-project/
    bwb/
      rain/
        METADA/
        rain.xls
      laboratory/
        METADATA/
        laboratory.xls

    kwb/
      discharge/
        METADATA/
        q01.csv
        q02.csv
        q03.csv

Data Processing {-}

As data processing we understand data in any stage of processing between raw and final, i.e. all intermediate results, such as different stages of data cleaning, different levels of data aggregation or different types of data visualisation.

We recommend to store data in these stages in their own space on the file system. This space is assumed to be a "playground", where the researchers are asked to store all these intermediate results. This space is where different approaches or models or scenarios can be tested and where, as a result, different versions of data are available. The data processing space is intended to be used for data only, and not for e.g. documents, presentations, or images.

Compared to the raw data network drive the data processing network drive is expected to require much more disk space.

In the data processing area the files are stored by project first. Within each project data may be organised by topic and/or data processing step.

//server/processing

test-project/
  01_data-cleaning
    METADATEN
    rain_raw.csv
    rain.csv
    quality.csv      
    discharge.csv
  02_modelling
    summer
    winter
    VERSIONEN
      v0.1
      v1.0
        summer
        winter
  software

Result Data {-}

With result data we mean clean, aggregated, well formatted data sets. Result data sets are the basis for interpretation or assessment and for the formulation of research findings. We consider all data that are relevant for the reporting of project results as result data. Result data will very often be spreadsheet data but they can also comprise other types of data such as figures or diagrams. We propose to prepare result data in the data processing area (see above) and to put symbolic links that point to the corresponding locations in the data processing folder into the result data folder. The idea is that the result area always gives the view onto the “best available” (intermediate) project results at a given point in time. Using symbolic links instead of file copies avoids accidental modification of data in the result data area that are actually expected to happen in the data processing area.

Often, result data sets are the result of temporal aggregation. They are consequently smaller in size than raw data sets. There will also be less result data sets than there are data sets representing different stages of data processing. For these reasons, the result data space is expected to require much less disk space than the spaces dedicated to raw data and data in processing.

The structure in the result data area should represent the project structure. It could e.g. be organised by working package. When being organised by working package the folder names should not only contain the working package number but also indicate the name of the working package.

//server/projects

  test-project/
    Data-Work Packages
      wp-1_monitoring
      wp-2_modelling
        summer.lnk # symbolic links to last version 
        winter.lnk # in data processing

Clean Datasets {-}

In a project-driven research institute, almost all data processing steps are closely related to their specific research project. One project may e.g. require to prepare rain data for being fed into a rainfall-runoff and sewer simulation software. Unprocessed (raw) rain data are received from a rain gauge station and cleaned. The clean rain data are then converted to the format that is required by the sewer simulation software.

In this example, the clean rain data are a data processing output. They are also the input to further processing and thus the source of even more valuable results.

The specific rain data file that is input to the sewer modelling software is the final (rain data) result. However, the clean rain data that are an intermediate result in the context of one project are themselves already valuable results. They can be used in other projects that require clean rain data for other purposes, e.g. for looking at climate change effects.

We recommend to store clean datasets in their own space. In this space the datasets are organised by topic, not by project:

//server/treasure
  rain/
  flow/
  level/

This increases the visibility of existing clean datasets and reduces the risk that work that has already been done in one project is done again in another project. Often, people start again from the raw data even though somebody already cleaned that data.

Metadata {-#folder-structure-metadata}

We recommend to describe the meaning of subfolders in a file README.yaml in the folder that contains the subfolders.

Example for such a README.yaml file:

rlib:
  created-by: Hauke Sonnenberg
  created-on: 2019-04-05
  description: >
    R library for packages needed for R training at BWB.
    To use the packages from that folder, use 
    .libPaths(c(.libPaths(), "C:/_UserProgData/rlib"))

rlib_downloads:
  created-by: Hauke Sonnenberg
  created-on: 2019-04-05
  description: >
    files downloaded by install.packages(). Each file represents a package
    that is installed in the rlib folder.

Restrictions/Conventions:

Naming of Files and Folders {#data-storage-naming}

A concise and meaningful name of your file and folder is the key to relocate your data. Whenever you have the freedom to name your data file and structure your project folder you should do so. Names should be concise and meaningful to you and your colleagues. Your colleague, who may not be familiar with the project, should be able to guess the content of a folder or a file by intuition. Naming conventions are also necessary to avoid read errors during automatic data processing and to prevent errors when working on different operating systems.

Please comply with the following rules:

Rule A: Allowed Characters {-}

The following set of characters are authorized in file or folder names:

If you want to know why some characters are not authorized, please check the FAQ:

Instead of German umlauts and the sharp s (ä, ö, ü, Ä, Ö, Ü, ß) use the following substitutions: ae, oe, ue, Ae, Oe, Ue, ss.

Rule B: Separation of Words or Parts of Words {-}

Please use the characters underscore _ or hyphen instead of space. Use underscore _ to separate words that contain different types of information:

Use hyphen - instead of underscore _ to visually separate the parts of compound words or names:

Use hyphen - (or no separation at all) in dates (i.e. 2018-07-02 or 20180702).

Using hyphen instead of underscore in composed words will not split the composed words into their parts when splitting a file or folder name at underscore.

For example, splitting the name project-report_example-project-1_v1.0_2018-07-02 at underscore results in the following words (giving different types of information on the file or folder)

Rule C: Capitalisation {-}

From the pure data management’s point of view it would be best not to use upper case letters in file or folder names at all. This would avoid possible conflicts when exchanging files between operating systems that either care about case in file names (as e.g. Unix systems) or not (as e.g. Windows systems).

If allowing upper case letters it should be decided on if and when to use capitals. Having a corresponding rule in place, only one of the following spellings would, for example, be allowed:

Rule D: Avoid Long Names {-}

At least on Windows operating systems, very long file paths can cause trouble. When copying or moving a file to a target path that exceeds a length of 260 characters an error will occur. This is particularly unfortunate when copying or moving many files at once and when the process stops before completion. As the length of a file path mainly depends on the lengths of its components, we suggest to restrict

This would allow a file path to contain nine subfolder names at maximum. The maximum number of subfolders, i.e. the maximum folder depth, should be kept small by following best-practices in Folder Structures. If folder or file names are generated by software (e.g. logger software, modelling software, or reference manager) please check if the software allows to modify the naming scheme. If we nevertheless have to deal with deeply nested folder structures and/or very long file or folder names we should store them in a flat folder hierarchy, i.e. not in

\\server\projekte$\department-name\projects\project-name\
  data-work-packages\work-package-one-meaning-the-following\modelling\
  scenario-one-meaning-the-following\results.

Rule E: Formatting of Dates and Numbers {-}

When adding date information to file names, please use one of these formats:

By doing so, file or folder names that differ only in the date will be displayed in chronological order. Using the first form improves the visual distinction of the year, month and day part of the date. Using hyphen instead of underscore will keep these parts together when splitting the name at underscore (see above).

When using numbers in file or folder names to bring them into a certain order, use leading zeros as required to make all numbers used in one folder level have the same length. Otherwise they will not be displayed in chronological order in your file browser.

Example:

Rule F: Allowed Words {-}

We recommend to define sets of allowed words in so called vocabularies. Only words from the vocabularies are then expected to appear as words in file or folder names. Getting accustomed to the words from the vocabularies and their meanings allows for more precise file searching. This is most important to clearly indicate that a file or folder relates to "special objects", such as projects or organisations or monitoring sites. At least for projects and organisations we want to define vocabularies in which "official" acronyms are defined for all projects and all organisations from which we expect to receive data (see the chapter on acronyms). Always using the acronyms defined in the vocabularies allows to search for files or folders belonging to one specific project or being provided by one specific organisation.

We could also define vocabularies of words describing other properties of a file or folder. We could e.g. decide to always use clean-data instead of data-clean, cleaned-data, data-cleaning, Datenbereinigung, bereinigte-daten, and so on.

Rule G: Order of Words {-}

We could go one step further and define the order in which we expect the words to appear in a file or folder name. Which types of information should go first in the filename? The order of words determines in which way files are grouped visually when being listed by their name. If the acronym of the organisation goes first, files are grouped by organisation. If the acronym of the monitoring site goes first, files are grouped by monitoring site. According rules cannot be set on a global level, i.e. for the whole company or even for a whole project. The requirements will be different depending on the type of information that are to be stored. We recommend to define naming conventions where appropriate and to describe them in a metadata file in the folder below which to apply the naming convention.

Rule H: Allowed Languages {-}

Do not mix words from different languages within one and the same file or folder name. For example, use regen-ereignis or rain-event instead of regen-event or rain-ereignis.

Within one project, use either only English words or only German words in file or folder names. This restriction may be too strict. However, I think that we should follow this rule at least for the top level folder structures. It is not nice that we see folders AUFTRAEGE (German) and GROUNDWATER (English) as folder names within the same parent folder.

Versioning

Versioning or version control is the way by which different versions and drafts of a document (or file or record or dataset or software code) are managed. Versioning involves the process of naming and distinguishing between a series of draft documents that leads to a final or approved version in the end. Versioning "freezes" certain development steps and allows you to disclose an audit trail for the revision and update of drafts and final versions. It is essential to reproduce results that may be based on older data.

Manual {-}

Manual versioning may costs more time and requires some discipline, but ensures long-term clean and generally understandable file structure and provides a quick overview of the actual status of development. Manual versioning does not require additional software (except a simple text editor) and is realized by following these simple guidelines:

It is noteworthy that “final” does not necessarily mean ultimately. Final forms are subject to modification and it is sometimes questionable whether a final status has been reached. Therefore, it is more important to be able to track the modifications in the VERSIONS.txt rather than arguing on version numbers.

Example:

BestPractices_Workshop.ppt
VERSIONS/
  + VERSIONS.txt
  + BestPractices_Workshop_v0.1.ppt
  + BestPractices_Workshop_v0.2.ppt
  + BestPractices_Workshop_v1.0.ppt

Content of file VERSIONS.txt:

BestPractices_Workshop.ppt
- v1.0: first final version, after review by NAME
- v0.2: after additions by NAME
- v0.1: first draft version, written by NAME

Automatic {-}

Automatic versioning is mandatory in case of programming.

The versioning is done automatically in case a version control software, like Git or Subversion are used.

At KWB we currently use the following version control software:

```{block2 rmdcaution} Use of version control software is required in case of programming (e.g. in R, Python, and so on) and can be useful in case of tracking changes in small text files (e.g. configuration files that run a specific R script with different parameters for scenario analysis).

**Drawbacks:**

* Special software ([TortoiseSVN](https://tortoisesvn.net/index.de.html)), login 
data for each user on KWB-Server and some basic training are required

* In case of collaborate coding: sticking to 'best-practices' for using version 
control is mandatory, e.g.: 

    + timely check in of code changes to the central server, 

    + Speaking to each other: so that not two people work at the same time at the 
    same program code in one script as this leads to conflicts that need to be 
    resolved manually, which can be quite time demanding. You are much better off 
    if you avoid this in the upfront by talking to each other

**Advantages:**

* Only one filename per script (file history and code changes are managed either 
internally on a KWB server in case of using TortoiseSVN or externally for code 
hosted on Github) 

* Old versions of scripts can be restored easily 

* Additional comments during `commit` (i.e. at the time of transfering the code 
from the local computer to the central version control system about `why` code 
changes were made and build-in diff-tools for tracking changes improve the 
reproducibility

```{block2,  type='rmdwarning'}
Attention: version control software is not designed for versioning of raw data 
and thus should not be used for it. General thoughts on the topic of 'data 
versioning' are available here: [https://github.com/leeper/data-versioning](https://github.com/leeper/data-versioning) 

```{block2, type = 'rmdnote'}

A presentation with different tools for version control is available here: https://www.fosteropenscience.eu/node/597

```



KWB-R/fakin.doc documentation built on Sept. 27, 2019, 9:53 p.m.