A logical and consistent folder structure, naming conventions, versioning rules as well as provision of metadata for your data files help you and your colleagues find and use the data.
The following rules will save time on the long run and will help avoid
data duplication. With the following recommendations we intend to comply with r citet("10.1371/journal.pcbi.1005097")
.
In general, data related to one project should be accessible to all the persons that are involved in the project. It should be avoided that project relevant files are permanently stored on local hard drives or on personal network drives. Project data should therefore be stored on network drives to which all project members have access.
A research institute aims at creating knowledge from data. A typical data flow may look like the following: raw data are received or collected from foreign sources or created from own measurements. The raw data are processed, i.e. cleaned, aggregated, filtered and analysed. Finally result data sets are composed from which conclusions are drawn and published. To cut it short: raw data get processed and become result data.
In computer science, the clear distinction between data input, data processing and data output is a well-known and widely-used model to describe the structure of an information processing program. It is referred to as the input-process-output (IPO) model. We recommend to apply this model to the distinction of different categories of data:
raw data
(= input data),
data in processing
(= data processing) and
result data
(= output data).
Using the IPO model has the following benefits:
Minimizes the risk of overwriting, deleting of files and folders by automatic data processing (e.g. scripts).
Rawdata are protected against accidental overwriting.
Helps to keep files and folders clearly organised.
Reflects roles and responsibilities of different project team members (e.g. # project manager is mainly interested in results).
Helps avoid deep folder structures.
According to the three categories we suggest to create three different areas, represented by three different network drives, on the top level of your file server. The first area is for raw data, the second area is for data in processing and the third area is for result data.
Within each area, the data are organised by project first, i.e. each project is represented by one folder within each of the network drives:
//server/rawdata project-1/ project-2/ … //server/processing project-1/ project-2/ … //server/results project-1/ project-2/ …
Sub-folder structure within the project folders in each of these top-level network drives is described in the following.
As raw data we define data that we receive from a measurement device, project partner or other external sources (e.g. internet download) even if these data were processed externally. Raw data are, for example, analytical measurements from laboratories, filled out questionnaires from project partners, meteorological data from other institutes, measurements from loggers or sensors, or scans of hand written manual sampling protocols. Especially in environmental sciences, raw data often cannot or only with high costs be reproduced (e.g. rainfall, river discharge measurements). They are therefore of high value. Raw data are often large in terms of file size or number (e.g. measurements of devices logging at high temporal resolution). Raw data can already come in a complex, deep folder structure. Raw data are closely related to metadata such as sensor configurations generated by loggers or email correspondence when receiving data by email from external partners. We acknowledge the high value of raw data by storing them on a dedicated, protected space and by requiring them to be accompanied by metadata.
Raw data are stored in an unmodified state. All modifications of the data are to be done on a copy of the raw data in the “processing” space (see next). The only modification allowed is the renaming of a raw data file, given that the renaming is documented in the metadata. Once stored, raw data are to be protected from being accidentally deleted or modified. This is achieved by making the raw data space write-protected ([link] FAQ: How to make a file write protected).
We propose to organise raw data
:
by project first and
by origin (i.e. source or owner) of the data second.
We will create a network folder //server/rawdata
in which all files have set
the read-only property. We suggest to store raw data by project first and by the
organisation that owns (i.e. generated, provided) the data second. This could
look like this:
//server/rawdata ORGANISATIONS.txt PROJECTS.lnk [Symbolic Link to PROJECTS.txt in //server/projects$] test-project/ bwb/ rain/ METADA/ rain.xls laboratory/ METADATA/ laboratory.xls kwb/ discharge/ METADATA/ q01.csv q02.csv q03.csv
As data processing
we understand data in any stage of processing between
raw
and final
, i.e. all intermediate results, such as different stages of
data cleaning, different levels of data aggregation or different types of data
visualisation.
We recommend to store data in these stages in their own space on the file system.
This space is assumed to be a "playground", where the researchers are asked to
store all these intermediate results. This space is where different approaches
or models or scenarios can be tested and where, as a result, different versions
of data are available. The data processing space is intended to be used for data
only, and not for e.g. documents, presentations, or images.
Compared to the raw data network drive the data processing network drive is expected to require much more disk space.
In the data processing area the files are stored by project first. Within each project data may be organised by topic and/or data processing step.
//server/processing test-project/ 01_data-cleaning METADATEN rain_raw.csv rain.csv quality.csv discharge.csv 02_modelling summer winter VERSIONEN v0.1 v1.0 summer winter software
With result data
we mean clean, aggregated, well formatted data sets. Result
data sets are the basis for interpretation or assessment and for the formulation
of research findings. We consider all data that are relevant for the reporting
of project results as result data. Result data will very often be spreadsheet
data but they can also comprise other types of data such as figures or diagrams.
We propose to prepare result data in the data processing area (see above) and to
put symbolic links that point to the corresponding locations in the data processing
folder into the result data folder. The idea is that the result area always gives
the view onto the “best available” (intermediate) project results at a given
point in time. Using symbolic links instead of file copies avoids accidental
modification of data in the result data area that are actually expected to
happen in the data processing area.
Often, result data sets are the result of temporal aggregation. They are consequently smaller in size than raw data sets. There will also be less result data sets than there are data sets representing different stages of data processing. For these reasons, the result data space is expected to require much less disk space than the spaces dedicated to raw data and data in processing.
The structure in the result data area should represent the project structure. It could e.g. be organised by working package. When being organised by working package the folder names should not only contain the working package number but also indicate the name of the working package.
//server/projects test-project/ Data-Work Packages wp-1_monitoring wp-2_modelling summer.lnk # symbolic links to last version winter.lnk # in data processing
In a project-driven research institute, almost all data processing steps are closely related to their specific research project. One project may e.g. require to prepare rain data for being fed into a rainfall-runoff and sewer simulation software. Unprocessed (raw) rain data are received from a rain gauge station and cleaned. The clean rain data are then converted to the format that is required by the sewer simulation software.
In this example, the clean rain data are a data processing output. They are also the input to further processing and thus the source of even more valuable results.
The specific rain data file that is input to the sewer modelling software is the final (rain data) result. However, the clean rain data that are an intermediate result in the context of one project are themselves already valuable results. They can be used in other projects that require clean rain data for other purposes, e.g. for looking at climate change effects.
We recommend to store clean datasets in their own space. In this space the datasets are organised by topic, not by project:
//server/treasure rain/ flow/ level/
This increases the visibility of existing clean datasets and reduces the risk that work that has already been done in one project is done again in another project. Often, people start again from the raw data even though somebody already cleaned that data.
We recommend to describe the meaning of subfolders in a file README.yaml
in the
folder that contains the subfolders.
Example for such a README.yaml
file:
rlib: created-by: Hauke Sonnenberg created-on: 2019-04-05 description: > R library for packages needed for R training at BWB. To use the packages from that folder, use .libPaths(c(.libPaths(), "C:/_UserProgData/rlib")) rlib_downloads: created-by: Hauke Sonnenberg created-on: 2019-04-05 description: > files downloaded by install.packages(). Each file represents a package that is installed in the rlib folder.
Restrictions/Conventions:
Each top-level folder should represent a project, i.e. should be defined in
the top level file PROJECTS.txt
.
Each possible owner should be defined in the top level file
ORGANISATIONS.txt
.
The naming convention for the organisations is the same as for projects. s)
A concise and meaningful name of your file and folder is the key to relocate your data. Whenever you have the freedom to name your data file and structure your project folder you should do so. Names should be concise and meaningful to you and your colleagues. Your colleague, who may not be familiar with the project, should be able to guess the content of a folder or a file by intuition. Naming conventions are also necessary to avoid read errors during automatic data processing and to prevent errors when working on different operating systems.
Please comply with the following rules:
The following set of characters are authorized in file or folder names:
upper case letters A-Z
,
lower case letters a-z
,
numbers 0-9
,
underscore _
,
hyphen -
,
dot .
If you want to know why some characters are not authorized, please check the FAQ:
Instead of German umlauts and the sharp s (ä
, ö
, ü
, Ä
, Ö
, Ü
, ß
)
use the following substitutions: ae
, oe
, ue
, Ae
, Oe
, Ue
, ss
.
Please use the characters underscore _
or hyphen –
instead of space
.
Use underscore _
to separate words that contain different types of information:
results_today
instead of results-today
protocol_hauke
instead of protocol-hauke
Use hyphen -
instead of underscore _
to visually separate the parts of
compound words or names:
site-1
instead of site_1
,
dissolved-oxygen
instead of dissolved_oxygen
,
clean-data
instead of clean_data
.
Use hyphen -
(or no separation at all) in dates (i.e. 2018-07-02
or 20180702
).
Using hyphen instead of underscore in composed words will not split the composed words into their parts when splitting a file or folder name at underscore.
For example, splitting the name project-report_example-project-1_v1.0_2018-07-02
at underscore results in the following words (giving different types of
information on the file or folder)
project-report
(type of document),example-project-1
(name of related project),v1.0
(version number),2018-07-02
(version date).From the pure data management’s point of view it would be best not to use upper case letters in file or folder names at all. This would avoid possible conflicts when exchanging files between operating systems that either care about case in file names (as e.g. Unix systems) or not (as e.g. Windows systems).
If allowing upper case letters it should be decided on if and when to use capitals. Having a corresponding rule in place, only one of the following spellings would, for example, be allowed:
dissolved-oxygen
(all lower case),
dissolved-Oxygen
(attributes lower case, nouns upper case),
Dissolved-oxygen
(first letter upper case),
Dissolved-Oxygen
(all parts of compound words upper case).
At least on Windows operating systems, very long file paths can cause trouble. When copying or moving a file to a target path that exceeds a length of 260 characters an error will occur. This is particularly unfortunate when copying or moving many files at once and when the process stops before completion. As the length of a file path mainly depends on the lengths of its components, we suggest to restrict
folder names to no more than 20 characters and
file names to no more than 50 characters.
This would allow a file path to contain nine subfolder names at maximum. The maximum number of subfolders, i.e. the maximum folder depth, should be kept small by following best-practices in Folder Structures. If folder or file names are generated by software (e.g. logger software, modelling software, or reference manager) please check if the software allows to modify the naming scheme. If we nevertheless have to deal with deeply nested folder structures and/or very long file or folder names we should store them in a flat folder hierarchy, i.e. not in
\\server\projekte$\department-name\projects\project-name\ data-work-packages\work-package-one-meaning-the-following\modelling\ scenario-one-meaning-the-following\results.
When adding date information to file names, please use one of these formats:
yyyy-mm-dd
(e.g. 2018-06-28
)
yyyymmdd
(e.g. 20180628
)
By doing so, file or folder names that differ only in the date will be displayed in chronological order. Using the first form improves the visual distinction of the year, month and day part of the date. Using hyphen instead of underscore will keep these parts together when splitting the name at underscore (see above).
When using numbers in file or folder names to bring them into a certain order, use leading zeros as required to make all numbers used in one folder level have the same length. Otherwise they will not be displayed in chronological order in your file browser.
Example:
01
, 02
, 03
, etc. if there are 9 to 99 files/folders or
001
, 002
, 003
, etc. if there are 100 to 999 files/folders.
We recommend to define sets of allowed words in so called vocabularies. Only words from the vocabularies are then expected to appear as words in file or folder names. Getting accustomed to the words from the vocabularies and their meanings allows for more precise file searching. This is most important to clearly indicate that a file or folder relates to "special objects", such as projects or organisations or monitoring sites. At least for projects and organisations we want to define vocabularies in which "official" acronyms are defined for all projects and all organisations from which we expect to receive data (see the chapter on acronyms). Always using the acronyms defined in the vocabularies allows to search for files or folders belonging to one specific project or being provided by one specific organisation.
We could also define vocabularies of words describing other properties of a file
or folder. We could e.g. decide to always use clean-data
instead of
data-clean
, cleaned-data
, data-cleaning
, Datenbereinigung
,
bereinigte-daten
, and so on.
We could go one step further and define the order in which we expect the words to appear in a file or folder name. Which types of information should go first in the filename? The order of words determines in which way files are grouped visually when being listed by their name. If the acronym of the organisation goes first, files are grouped by organisation. If the acronym of the monitoring site goes first, files are grouped by monitoring site. According rules cannot be set on a global level, i.e. for the whole company or even for a whole project. The requirements will be different depending on the type of information that are to be stored. We recommend to define naming conventions where appropriate and to describe them in a metadata file in the folder below which to apply the naming convention.
Do not mix words from different languages within one and the same file or folder
name. For example, use regen-ereignis
or rain-event
instead of regen-event
or rain-ereignis
.
Within one project, use either only English words or only German words in
file or folder names. This restriction may be too strict. However, I think that
we should follow this rule at least for the top level
folder structures. It is not nice that we see folders
AUFTRAEGE
(German) and GROUNDWATER
(English) as folder names within the
same parent folder.
Versioning or version control is the way by which different versions and drafts of a document (or file or record or dataset or software code) are managed. Versioning involves the process of naming and distinguishing between a series of draft documents that leads to a final or approved version in the end. Versioning "freezes" certain development steps and allows you to disclose an audit trail for the revision and update of drafts and final versions. It is essential to reproduce results that may be based on older data.
Manual versioning may costs more time and requires some discipline, but ensures long-term clean and generally understandable file structure and provides a quick overview of the actual status of development. Manual versioning does not require additional software (except a simple text editor) and is realized by following these simple guidelines:
A version is created by copying the current file and pasting it to a subfolder named VERSIONS
Each successive draft of a file in the VERSIONS folder is numbered sequentially
from e.g. v0.1
, v0.2
, v0.3
as a postfix at the end of the file name (e.g.
filename_v0.1
, …v0.2
, …v0.3
, and so on)
Finalised forms (e.g. the presentations was held on a conference, the report
was reviewed) become entitled with a new version number, e.g. v1.0
, v2.0
and
so on
Read-only is applied to each versioned file (to prevent accidental loss of final versions of files)
Only files without version name as postfix are modified
A VERSIONS.txt
is created and kept up-to-date with a text editor, containing
meta information on purpose of the modification and the person who made it
It is noteworthy that “final” does not necessarily mean ultimately. Final forms
are subject to modification and it is sometimes questionable whether a final
status has been reached. Therefore, it is more important to be able to track the
modifications in the VERSIONS.txt
rather than arguing on version numbers.
Example:
BestPractices_Workshop.ppt VERSIONS/ + VERSIONS.txt + BestPractices_Workshop_v0.1.ppt + BestPractices_Workshop_v0.2.ppt + BestPractices_Workshop_v1.0.ppt
Content of file VERSIONS.txt
:
BestPractices_Workshop.ppt - v1.0: first final version, after review by NAME - v0.2: after additions by NAME - v0.1: first draft version, written by NAME
Automatic versioning is mandatory in case of programming.
The versioning is done automatically in case a version control software, like Git or Subversion are used.
At KWB we currently use the following version control software:
Subversion: for internally storing programm code (e.g. R-scripts/packages) we have an Subversion server, which is accessible from the KWB intranet. However, this requires:
the installation of the client software TortoiseSVN and a
valid user account (for accessing the server) which is currently provided by the IT department on request
Git: for publishing programm code (e.g. R packages) external on our KWB organisation group on Github. Currently all repositories are public (i.e. are visible for everyone), but also use of private repositories is possible for free as KWB is recognised as non-for-profit company by Github, offering additional benefits for free
```{block2 rmdcaution} Use of version control software is required in case of programming (e.g. in R, Python, and so on) and can be useful in case of tracking changes in small text files (e.g. configuration files that run a specific R script with different parameters for scenario analysis).
**Drawbacks:** * Special software ([TortoiseSVN](https://tortoisesvn.net/index.de.html)), login data for each user on KWB-Server and some basic training are required * In case of collaborate coding: sticking to 'best-practices' for using version control is mandatory, e.g.: + timely check in of code changes to the central server, + Speaking to each other: so that not two people work at the same time at the same program code in one script as this leads to conflicts that need to be resolved manually, which can be quite time demanding. You are much better off if you avoid this in the upfront by talking to each other **Advantages:** * Only one filename per script (file history and code changes are managed either internally on a KWB server in case of using TortoiseSVN or externally for code hosted on Github) * Old versions of scripts can be restored easily * Additional comments during `commit` (i.e. at the time of transfering the code from the local computer to the central version control system about `why` code changes were made and build-in diff-tools for tracking changes improve the reproducibility ```{block2, type='rmdwarning'} Attention: version control software is not designed for versioning of raw data and thus should not be used for it. General thoughts on the topic of 'data versioning' are available here: [https://github.com/leeper/data-versioning](https://github.com/leeper/data-versioning)
```{block2, type = 'rmdnote'}
A presentation with different tools for version control is available here: https://www.fosteropenscience.eu/node/597
```
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.