In KWB-R/fakin.doc: Does not matter.

Data Publishing and Sharing

knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)

```r', fig.link='https://doi.org/10.1371/journal.pcbi.1005278/'} knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)

```r
knitr::include_graphics("images/tensimplerules_datasharing.png", dpi = NA)

"This figure provides a framework for understanding how the “Ten Simple Rules to Enable Multi-site Collaborations through Data Sharing” r citep("10.1371/journal.pcbi.1005278") can be translated into easily understood modern life concepts.

Rule 1 is Open-Source Software. The openness is signified by a window to a room filled with algorithms that are represented by gears.

Rule 2 involves making the source data available whenever possible. Source data can be very useful for researchers. However, data are often housed in institutions and are not publicly accessible. These files are often stored externally; therefore, we depict this as a shed or storehouse of data, which, if possible, should be provided to research collaborators.

Rule 3 is to “use multiple platforms to share research products.” This increases thechances that other researchers will find and be able to utilize your research product—this is represented by multiple locations (i.e., shed and house).

Rule 4 involves the need to secure all necessary permissions a priori. Many datasets have data use agreements that restrict usage. These restrictions can sometimes prevent researchers from performing certain types of analyses or publishing in certain journals (e.g., journals that require all data to be openly accessible); therefore, we represent this rule as a key that can lock or unlock the door of your research.

Rule 5 discusses the privacy issues that surround source data. Researchers need to understand what they can and cannot do (i.e., the privacy rules) with their data. Privacy often requires allowing certain users to have access to sections of data while restricting access to other sections of data. Researchers need to understand what can and cannot be revealed about their data (i.e., when to open and close the curtains).

Rule 6 is to facilitate reproducibility whenever possible. Since communication is the forte of reproducibility, we depicted it as two researchers sharing a giant scroll, because data documentation is required and is often substantial.

Rule 7 is to “think global.” We conceptualize this as a cloud. This cloud allows the research property (i.e., the house and shed) to be accessed across large distances.

Rule 8 is to publicize your work. Think of it as “shouting from the rooftops.” Publicizing is critical for enabling other researchers to access your research product.

Rule 9 is to “stay realistic.” It is important for researchers to “stay grounded” and resist the urge to overstate the claims made by their research.

Rule 10 is to be engaged, and this is depicted as a person waving an “I heart research” sign. It is vitally important to stay engaged and enthusiastic about one’s research. This enables you to draw others to care about your research."

---- r citep("10.1371/journal.pcbi.1005278")

Recommended literature:

Ten simple rules to enable multi-site collaborations through data sharing r citep("10.1371/journal.pcbi.1005278")
Guidelines for publishing (PhD) research data r citep(manual["Kaden_2018"])

Repositories

Repositories for permanently deposing data are for example:

General
- Figshare,
- Zenodo (a joint project between OpenAIRE and CERN),
- Mendeley data,
- Dataverse,
- Dryad
Focus on environmental and earth sciences
- Pangea
- GFZ Potsdam data services

Repositories for publishing program code are:

Github or
Gitlab.

However, both do not offer long term data preservation by default, but using Github it is posible to make the code citable by linking it with Zenodo (see: https://guides.github.com/activities/citable-code/).

We are currently using the following three repositories for publishing program code (mainly R packages):

Github: for developing and publishing program code (mainly R packages) we use https://github.com/kwb-r. Currently 81 (i.e. 38 public and 43 private) repositories are published on this Github account. For all 32 public R packages there is also a detailed status report available available at https://kwb-r.github.io/status/ , e.g. with information on license, documentation and the "health" of the R package (i.e. whether it can be successfully installed on Linux or Windows platforms).
Zenodo: for automatically getting a DOI for each software release made in one of our public Github repositories, e.g. aquanes.report (for details see: https://guides.github.com/activities/citable-code/) and
Gitlab: as backup mirror (https://gitlab.com/kwb-r) for all of our currently 81 (i.e. 34 public and 47 private) repositories currently published on our Github account (https://github.com/kwb-r)

Proposal: define company-wide QMS policy ("top-down") for publishing program code

The above workflow was established from "bottom-up" (i.e. Michael Rustler and Hauke Sonnenberg) with the idea in mind to make the code as open as possible (e.g. by chossing the permisse MIT license as default for all of our public R packages).

However, up to now there is no company wide strategy ("top-down") defined yet that would legitimate this "bottom-up" approach. This creates uncertainty (e.g. what can be published?), so that much more code than necessary is labelled as "private". To reduce this uncertainty the following QMS policy is proposed, which should be discussed and agreed on in one of the next KWB management meetings:

Sponsor projects (e.g. funded by BMBF, EU): source code will be published by default at https://github.com/kwb-r in public repositories (i.e. it will be accessible for everyone) under the permissive MIT license in case that the source code does not:
contain security critical paths (e.g. to our company server) or
confidential data.

Code should be developed in such a way that both of the criteria (security critical paths, confidential data) defined above are considered. Making the code openly available will decrease the burden to install them (e.g. not each student needs to get an "access" token to install private repositories, as required for "contract" projects, see below).

Contract projects (e.g. funded by BWB, Veolia): will be published in private repositories by default at https://github.com/kwb-r in case the funder does not pre-define a specific repository. Access to the source code is thus resticted to KWB researchers and students working in the contract project. Project partners and funders can access the source code only if they get an "access token" from the KWB project team.

```{block2 type = "rmdtip"} A blog post by @Bosman_2016 provide results of a large survey carried out in 2015 among more than 15000 researchers. Insights can be gained on:

Which scholary communications tools are used and
Are there disciplinary differences in usage?

They finally summarise: "Another surprising finding is the overall low use of Zenodo – a CERN-hosted repository that is the recommended archiving and sharing solution for data from EU-projects and -institutions. The fact that Zenodo is a data-sharing platform that is available to anyone (thus not just for EU project data) might not be widely known yet."

### ORCID

Problem:

>"Two large challenges that researchers face today are discovery and evaluation. 
We are overwhelmed by the volume of new research works, and traditional discovery 
tools are no longer sufficient. We are spending considerable amounts of time 
optimizing the impact—and discoverability—of our research work so as to support 
grant applications and promotions, and the traditional measures for this are not 
enough. 
> --- `r citep(manual["Fenner2014"])`

Solution:

>"Open Researcher & Contributor ID ([ORCID](http://orcid.org/)) is an international, interdisciplinary, open and not-for-profit organization created to solve the 
researcher name ambiguity problem for the benefit of all stakeholders. 
[ORCID](http://orcid.org/)was built with the goal of becoming the universally 
accepted unique identifier for researchers: 
>
>1. ORCID is a community-driven organization
>
>2. ORCID is not limited by discipline, institution, or geography
>
>3. ORCID is an inclusive and transparently governed not-for profit organization
>
>4. ORCID data and source code are available under recognized open licenses
>
>5. the ORCID iD is part of institutional, publisher, and funding agency 
infrastructures.
>
>Furthermore, [ORCID](http://orcid.org/) recognizes that existing researcher and 
identifier schemes serve specific communities, and is working to link with, 
rather than replace, existing infrastructures."
>
> --- `r citep(manual["Fenner2014"])`


### Licenses

>"In most countries in the world, creative work is protected by copyright laws.
International conventions, and primarily the Berne Convention of 1886, protect
the copyright of creators even across international borders for 50 years after
the death of the creator. This means that copying and using the creative work is
limited by conditions set by the creator, or another copyright holder. For
example, in many cases musical recordings may not be copied and further
distributed without the permission of the musician, or of the production company
that has acquired the copyright from the musician. Facts about the universe that
are discovered through research are not subject to copyright, but the
collection, aggregation, analysis and interpretation of research data may be
considered creative work, and could be protected by copyright laws. Thus, the
consumption of research publications is governed by copyright law. Furthermore,
even data sharing is often governed by copyright laws, because the compilation
of data to be shared often requires a creative effort. Another case of
resarch-relevant copyrighted products is software that is developed in the
course of research. In all of these cases, if license terms are not explicitly
specified, the work is considered to be protected as "all rights reserved". This
means that no one but the creator of the work can use the work unencumbered. For
software this means that copying and further distribution of the software is
prohibited. Even running the software may be restricted. The exact selection of
a license is beyond the scope of this section, but depends on your intentions
and goals with regard to the software"
>
> --- `r citep(manual["Rokem_2018"])`

Recommended literature: 

- [Intellectual Property and Computational Science](https://link.springer.com/chapter/10.1007/978-3-319-00026-8_19) `r citep(manual["Stodden2014"])`

- [forschungslizenzen.de](http://www.forschungslizenzen.de) `r citep(manual["Forschungslizenzen"])`

- [Creative Commons Licences](https://link.springer.com/chapter/10.1007/978-3-319-00026-8_19) `r citep(manual["Friesike2014"])`

- [choosealicense.com/](https://choosealicense.com/)


### File Formats 

>"Scientific data is saved in a myriad of file formats. A typical file format
might include a file header, describing the layout of the data on disk, metadata
associated with the data, and the data itself, often stored in binary format. In
some cases (e.g., CSV (or comma-separated value) files), data will be stored as
text. The danger of proliferation of file formats in scientific data lies in the
need to build and maintain separate software tools to read, write and process
all these data formats. This makes interoperability between different
practitioners more difficult, and limits the value of data sharing, because
access to the data in the files remains limited."
>
> --- `r citep(manual["Rokem_2018"])`


```r

if (!require("tibble")) {
  install.packages("tibble")
}

tbl_longterm_file_formats <- tibble::tribble(
  ~., ~More.than.ten.years, ~Up.to.ten.years, ~Not.suitable,
  "Text", "PDF/A, TXT, ASC, XML", "PDF, RTF, HTML, DOCX, PPTX, ODT, LATEX", "DOC, PPT",
  "Data", "CSV", "XLSX, ODS", "XLS",
  "Pictures", "TIFF, PNG, JPG 2000, SVG", "GIF, BMP, JPEG", "INDD, EPS",
  "Audio", "WAV", "MP3, MP4", "",
  "Video", "Motion JPG 2000, MOV", "MP4", "WMV"
)

names(tbl_longterm_file_formats) <- c(
  "", "More than ten years", "Up to ten years",
  "Not suitable"
)

if (!require("kableExtra")) {
  install.packages("kableExtra")
}
library(kableExtra, quietly = TRUE)

knitr::kable(tbl_longterm_file_formats,
  format = "html",
  align = "c",
  caption = "Suitability of file formats for long-term preservation [@Kaden_2018]"
) %>%
  kableExtra::column_spec(2:4, width = "5cm")

kableExtra::column_spec(
  kable_input = knitr::kable(tbl_longterm_file_formats, 
    "latex",
    align = "c",
    booktabs = T,
    caption = "Suitability of file formats for long-term preservation \\citep{Kaden_2018}"
  ),
  column = 2:4, width = "4cm"
)

knitr::kable(tbl_longterm_file_formats,
             format = "markdown",
             align = "c", 
  caption = "Suitability of file formats for long-term preservation [@Kaden_2018]"
)

Data Exchange Standards

WaterML2:

"...is a new data exchange standard in Hydrology which can basically be used to exchange many kinds of hydro-meteorological observations and measurements. WaterML2 has been initiated and designed over a period of several years by a group of major national and international organizations from public and private sector, such as CSIRO, CUAHSI, USGS, BOM, NOAA, KISTERS and others. WaterML2 has been developed within the OGC Hydrology Domain Working group which has a mandate by the WMO, too."

--- WaterML2

ODM2: is an information model and supporting software ecosystem for feature-based earth observations

KWB-R/fakin.doc documentation built on Sept. 27, 2019, 9:53 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com