In cjvanlissa/worcs: Workflow for Open Reproducible Code in Science

library("papaja")
mask <- TRUE
masked <- function(text){
  ifelse(mask, "[masked]", text)
}
knitr::opts_chunk$set(echo = FALSE, eval = TRUE, warning = FALSE, message= FALSE)

Introduction

Academia is arguably past the tipping point of a paradigm shift towards open science. Support for this transition was first motivated by several highly publicized cases of scientific fraud [ @leveltFailingScienceFraudulent2012], increasing awareness of questionable research practices [@johnMeasuringPrevalenceQuestionable2012], and the replication crisis [@shroutPsychologyScienceKnowledge2018]. However, open science should not be seen as a cure (or punishment) for this crisis. As the late dr. Jonathan Tennant put it: "Open science is just good science" [@tennantOpenScienceJust2018]. Open science creates opportunities for researchers to more easily conduct reliable, cumulative, and collaborative science [@nosekScientificUtopiaOpening2012; @adolphOpenBehavioralScience2012]. Open science also promotes inclusivity, because it removes barriers for participation. Capitalizing on these advances has the potential to accelerate scientific progress [see also @coyneReplicationInitiativesWill2016], as has been aptly demonstrated by the role of open science in the response to the coronavirus pandemic [@sondervanCOVID19PandemicStresses2020].

Many researchers are motivated to adopt current best practices for open science and enjoy these benefits. And yet, the question of how to get started can be daunting. Making the transition requires researchers to become knowledgeable about different open science challenges and solutions, and to become proficient with new and unfamiliar tools. This paper is designed to ease that transition by presenting a simple workflow, based on established best practices,
that meets most requirements open science: The Workflow for Open Reproducible Code in Science (WORCS).

This paper introduces the conceptual workflow (referred to in upper case, "WORCS"), and discusses its underlying principles and how it meets different standards for open science. The principles underlying this workflow are universal and largely platform independent. We also present a software implementation of WORCS for R users [@rcoreteamLanguageEnvironmentStatistical2020]: The R package worcs [referred to in monospace font, @worcspackage2020]. This package offers a project template for RStudio [@rstudioteamRStudioIntegratedDevelopment2015], and several convenience functions to automate specific workflow steps. This paper is most relevant for scholars using R to conduct research that involves academic prose, analysis code, and tabular data. For other readers, it can serve as a primer on open science workflows, and provide a sensible starting point for a customized solution.

The WORCS workflow constitutes a step-by-step procedure that researchers can follow to make a research project open and reproducible. These steps are elaborated in greater detail below. See Figure \@ref(fig:workflow) for a flowchart that highlights important steps in different stages of the research process. WORCS is compatible with open science requirements already implemented by journals and institutions, and will help fulfill them. It can also be used by individual authors to produce work in accordance with best practices in the absence of top-down support or guidelines (i.e., "grass roots" open science).

knitr::include_graphics("workflow.png")
#knitr::include_graphics("workflow_graph/workflow.svg")

Defining open science practices

Although open science is advocated by many, it does not have a unitary definition. Throughout this paper, we adhere to the definition of open science as "the practice of science in such a way that others can collaborate and contribute, where research data, lab notes and other research processes are freely available, under terms that enable reuse, redistribution and reproduction of the research and its underlying data and methods" [the FOSTER Open Science Definition, as discussed in @sonjabezjakOpenScienceTraining2018]. In this context, the term "science" refers to all scholarly disciplines. We further define the term reproducible, as used in the WORCS acronym, as being able to re-perform "the same analysis with the same code using a different analyst" [@patilVisualToolDefining2019].

The "TOP-guidelines" are one of the most influential concrete operationalisations of general open science principles [@nosekPromotingOpenResearch2015a]. These guidelines describe eight standards for open science, the first seven of which "account for reproducibility of the reported results based on the originating data, and for sharing sufficient information to conduct an independent replication" [@nosekPromotingOpenResearch2015a]. These guidelines are: 1) Comprehensive citation of literature, data, materials, and methods; 2) sharing data, 3) sharing the code required to reproduce analyses, 4) sharing new research materials, and 5) sharing details of the design and analysis; 6) pre-registration of studies before data collection, and 7) pre-registration of the analysis plan prior to analysis. The eighth criterion is "not formally a transparency standard for authors", but "addresses journal guidelines for consideration of independent replications for publication". In this context, reproducibility can be defined as "re-performing the same analysis with the same code using a different analyst", and replicability as "re-performing the experiment and collecting new data" [@patilVisualToolDefining2019]. WORCS defines the goals of open science in terms of the first seven TOP-guidelines, and the workflow and its R implementation are designed to facilitate compliance therewith. We group these guidelines into three categories: citation (1), sharing (2-5), and preregistration (6-7).

Introducing the tools

WORCS relies solely on a set of free, open source software solutions which we will discuss before introducing the workflow.

Dynamic document generation

The first is dynamic document generation (DDG): Writing scientific reports in a format that interleaves written reports with blocks of code used to conduct the analyses. The text is automatically formatted as a scientific paper in several potential styles. When the text is formatted, the code blocks are evaluated and their results are insterted in the text, or rendered as figures and tables. Dynamic document generation supersedes the classical approach of using separate programs to write prose and conduct analyses, and then manually copy-pasting analysis results into the text. This paper is an example of DDG, and its source code is available here.

Although transitioning to DDG involves a slight learning curve, we strongly believe that the investment will pay off. Time saved from painstakingly copy-pasting output and manually formatting text soon outweighs the investment of switching to a new program. Moreover, human error in manually copying results is eliminated. When revisions require major changes to the analyses, all results, figures and tables are automatically updated. The flexibility in output formats also means that a manuscript can be rendered to presentation format, or even to a website or blog post. Moreover, the fact that code is evaluated each time the document is compiled requires researchers to work reproducibly, and allows reviewers and/or readers verify reproducibility simply by re-compiling the document. While writing academic papers in a programming environment might seem counter-intuitive at first, this approach is much more amenable to the needs of academics than most word processing software. It prevents mistakes, and saves time.

In the R implementation of WORCS, we recommend centering a research project around one dynamically generated RMarkdown document [@xieMarkdownDefinitiveGuide2018]. RMarkdown is based on Markdown, which uses plain-text commands to typeset a document. It enhances this format with support for R-code, and is compatible with LaTeX math typesetting. For a full overview of the functionality of RMarkdown, see @xieMarkdownDefinitiveGuide2018.

The RMarkdown document can incorporate blocks of analysis code, which can be used to render results directly to Tables and Figures, which can be dynamically referenced in the text. Long scripts can be stored in .R files, and included in the main document using the read_chunk() function from the knitr package, which comes together with the rmarkdown package. When a reader or reviewer compiles this document, all code is run automatically, thus verifying computational replicability. The RMarkdown document can be automatically formatted in many styles, including APA style [thanks to the R package papaja, @austPapajaPrepareReproducible2020], a host of other scientific formats [see @allaireRticlesArticleFormats2020], and as plain Markdown.

Version control

The second solution is version control: Maintaining an indelible log of every change to all project files. Version control is a near-essential tool for scientific reproducibility, as anyone learns who has had the experience of accidentally deleting a crucial file, or of being unable to reproduce analyses because they ran some analyses interactively, instead of documenting the syntax in a script file [see also @blischakQuickIntroductionVersion2016]. Many scientists use some form of implicit version control; for example, by renaming files after major changes (e.g., "manuscript_final_2.2-2019-10-12.doc"), tracking changes in word processing software, or using cloud hosting services that retain backups of previous versions.

An integral part of WORCS is the explicit version control software Git (www.git-scm.com). A project version controlled with Git is called a "repository", or repo. Git tracks changes to files in the repository on a line-by-line basis. The user can store these changes by making a "commit": a snapshot of the version controlled files. Each "commit" can contain as many changes as desired; for example, a whole new paragraph, or many small changes made to address a single reviewer comment. Each commit is further tagged with a message, describing the changes. The R implementation of worcs contains a convenience function, git_update("your commit message"), which creates a new commit for all changes since the previous commit, labels it with "your commit message", and pushes these changes to the remote repository. It is, effectively, a Git "quick save" command.

From an open science perspective, Git is particularly appealing because it retains a complete historical backlog of all commits. This can be used, for example, to verify that authors executed the steps outlined in a preregistration. Users can also compare changes between different commits, or go back to a previous version of the code (for example, after making a mistake, or to replicate a previous version of the results). Git only version controls files explicitly committed by the user. Moreover, it is possible to prevent files from being version controlled - which is useful for privacy sensitive data. A .gitignore file lists files that should not be version-controlled. These files thus exist only on the user's computer.

The functionality of Git is enhanced by services such as GitHub (https://github.com). GitHub is best understood as a cloud storage service with social networking functionality. The cloud storage aspect of GitHub allows you to "clone" (copy) a local Git repository to the GitHub website, as a backup or research archive. The GitHub platform offers additional functionality over Git. One such feature is that specific stages in the lifecycle of a project can be tagged as a "release": A named downloadable snapshot of a specific state of the project, that is prominently featured on the GitHub project page. The WORCS procedure encourages users to create releases to demarcate project milestones such as preregistration and submission to a journal. Releases are also useful to mark the submission of a revised manuscript, and publication.

The social network aspect of GitHub comes into play when a repository is made "public": This allows other researchers to peruse the repository and see how the work was done; clone it to their own computer to replicate the original work or apply the methods to their own data; open "Issues" to ask questions or give feedback on the project, or even send a "Pull request" with suggested changes to the text or code for your consideration. Git and GitHub shine as tools for collaboration, because different people can simultaneously work on different parts of a project, and their changes can be compared and automatically merged on the website. Even on solo projects, working with Git/GitHub has many benefits: Staying organized, being able to start a new study with a clone of an old, similar repository, or splitting off an "experimental branch" to try something new, while retaining the ability to "revert" (return) to a previous state of the project, or to "merge" (incorporate) the experimental branch.

Although GitHub is the most widely used remote repository, worcs supports alternative services. We will refer to all such services as "remote repositories". The worcs package uses gert [@oomsGertSimpleGit2019] to connect a local project to a remote repository. It also includes user-friendly functions to set user credentials, git_user(), and to add, commit, and push changed files to a remote repository in one step, git_update().

Dependency management

The third solution is dependency management: Keeping track of exactly what software was used to conduct the analyses. At first glance, it might seem sufficient to state that analyses were conducted in Program X. However, every program is susceptible to changes, updates, and bugfixes. Open Source software, in particular, is regularly updated because there is an active community of developers contributing functionality and bugfixes. Potentially, any such update could change the results of the code, thus rendering the analysis computationally non-reproducible.

Many solutions exist to ensure computational reproducibility, which differ in user-friendliness and effectiveness. These solutions typically work by enveloping your research project in a distinct "environment" that only has access to programs that are explicitly installed, and maintaining a record of these programs. When choosing an appropriate solution for dependency management, there is a tradeoff between ease-of-use on the one hand, and robustness on the other. In making this tradeoff, it is important to consider that all solutions have a limited shelf life [@brownHowLearnedStop2017]. Therefore, the best approach might be to use a "good-enough" solution, and acknowledge that all code requires some maintenance if you want to reproduce it in the future.

For dependency management, the R implementation of worcs relies on the recently released package renv [@usheyRenvProjectEnvironments2020]. This package strikes a great balance between user-friendliness and robustness. Developed by the team behind RStudio, renv maintains a text-based, human-readable log of all packages used, their version numbers, and where they were installed from (e.g., CRAN, Bioconductor, GitHub). This text-based log file can be version controlled with Git. When someone else loads your project, renv will install all of the required packages from this list. This lightweight approach safeguards reproducibility as long as these repositories maintain their archives. The worcs project template in RStudio automatically sets up renv for dependency management if the checkbox labeled "renv" is selected during project creation.

All solutions to computational reproducibility require users to make their dependencies explicit, often by reinstalling all software for each project in a controlled environment. This can lead to long installation times and large memory requirements. Similarly, renv requires users to call install.packages() for each library used - but it does not reinstall packages for each new project. Instead, packages reside in a cache that is shared between all projects on the computer. If a project uses a package that is already in the cache, it is not reinstalled, but linked from the cache. This improves the user experience by saving time and memory. As opposed to alternative solutions, renv is tightly integrated with the native R package management system. It monitors the scripts in the project directory for library() calls, and notifies the user when it is necessary to update the dependencies file using the snapshot() function. Therefore, renv does not require the users to change their workflow of installing/removing and using packages, and requires little technical skill to setup. It is not a fail-safe method to ensure strict computational reproducibility, but it meets most reproducibility requirements using only tools native to R and RStudio.

When a project is conducted in a different software environment than R, or if an R-project has external dependencies not tracked by renv, then users might choose to use Docker for dependency management.
Docker instantiates a "Docker container": An environment that behaves like a virtual computer, which can be stored like a time capsule, and identically reinstated on a user computer, or in the cloud. Although the use of Docker falls outside of the scope of the present paper, for existing users, a Docker build running worcs and its dependencies is available. The stetup with Docker vignette explains how to instantiate this build. For more information on this topic, @peikertReproducibleDataAnalysis2019 describe a Docker-based workflow for computational reproducibility that is conceptually and practically compatible with WORCS.

As Docker is somewhat challenging to set up for novice users, an R package is currently in development to facilitate Docker integration [repro, @repro]. Another example of a Docker-based workflow is the cloud-based collaboration platform 'Code Ocean'. This platform greatly simplifies setting up a reproducible computing environment with version control and dependency management. A key feature is that collaborators can log in to the server session, rather than each having to set up Docker on a local computer. One important consideration when using any cloud-computing solution is the necessity to upload human participant data for analysis. Such a platform would need to comply with relevant laws and ethical requirements. For example, as of this writing, the terms of use for Code Ocean do not address the GDPR, nor is it clear where data are stored. This is a potential liability for European scholars.

Text-based files are better

A key consideration when developing a research project is what filetypes to use. WORCS encourages the use of text-based files whenever possible, instead of binary files. Text-based files can be read by machines and humans alike. Binary files, such as Word (.docx) or SPSS (.sav, .spo) files, must be decoded first, often requiring commercial software. Although binary files can be version controlled with Git, their change log is uninterpretable. Binary files are also often larger than text-based files, which means they take up more space on cloud hosting services. For these reasons, uploading large binary files to remote repositories is frowned upon.

Two additional points are worth noting: First, as Git tracks line-by-line changes, it is recommended to start each sentence on a new line. That way, the change log will indicate which specific sentence was edited, instead of replacing an entire paragraph. As RMarkdown does not parse a single line break as the end of a paragraph, this practice does not disrupt the flow of a paragraph in the rendered version. Second, the remote repository GitHub renders certain text-based filetypes for online viewing: For example, .md (Markdown) files are displayed as web pages, and .csv files as spreadsheets. The worcs package uses this feature to, for example, render a codebook for the data as .md so it will be shown as a web page on GitHub.

Introducing the workflow

We provide a conceptual outline of the WORCS procedure below, based on WORCS Version 0.1.6. The conceptual workflow recommends actions to take in the different phases of a research project, in order to work openly and transparently in compliance with the TOP-guidelines and best practices for open science. Note that, although the steps are numbered for reference purposes, we acknowledge that the process of conducting research is not always linear. We refer users of the worcs implementation in R to the workflow vignette, which describes in more detail which steps are automated by the R package and which actions must be taken by the user. The vignette is also updated with future developments of the package. For examples of how the workflow is used in practice, we maintain a list of public worcs projects on the WORCS GitHub page.

get_workflow <- function(filename = "../vignettes/workflow.Rmd"){
  txt <- readLines(filename)
  txt <- txt[grepl("<!--S:", txt, fixed = T)|grepl("^###", txt)]
  txt <- gsub("(<!--S: |-->.*$)", "", txt)
  txt <- gsub("^#", "", txt)
  out <- vector("character")
  for(i in seq_along(txt)){
    if(startsWith(txt[i], "#")){
      out <- append(out, c("", txt[i], ""))
    } else {
      out <- append(out, txt[i])
    }
  }
  cat(out, sep = "\n")
}
get_workflow()

The R implementation of WORCS

WORCS is, first and foremost, a conceptual workflow that could be implemented in any software environment. As of this writing, the workflow has been implemented for R users in the package worcs [@worcspackage2020]. Several arguments support our choice to implement this workflow first in R and RStudio, although we encourage developers to port the software to other languages. First, R and all of it extensions are free open source software, which make it a tool of choice for open science. Second, all tools required for an open science workflow are implemented in R, most of these tools are directly accessible through the RStudio user interface, and some of them are actively developed by the team behind RStudio. Third, RStudio is a Public Benefit Corporation, whose commitment to open science is codified in the company charter:

RStudio’s primary purpose is to create free and open-source software for data science, scientific research, and technical communication. This allows anyone with access to a computer to participate freely in a global economy that rewards data literacy; enhances the production and consumption of knowledge; and facilitates collaboration and reproducible research in science, education and industry.

Fourth, R is the second-most cited statistical software package [@muenchenPopularityDataScience2012], following SPSS, which has no support for any of the open science tools discussed in this paper. Fifth, R is well-supported by a vibrant and inclusive online community, which develops new methods and packages, and provides support and tutorials for existing ones. Finally, R is highly interoperable: Packages are available to read and write nearly every filetype, including DOCX, PDF (in APA style), and HTML. Moreover, wrappers are available for many tools developed in other programming languages, and code written in other programming languages can be evaluated from R, including C++, Fortran, and Python [@allaireMarkdownPythonEngine2020]. There are excellent free resources for learning R [e.g., @grolemundDataScience2017].

Working with R is simplified immensely by using the RStudio integrated development engine (IDE) for R [@rstudioteamRStudioIntegratedDevelopment2015]. RStudio automates and streamlines tedious or complicated aspects of working with the tools used in WORCS, and many are embedded directly into the visual user interface. This makes RStudio a comprehensive solution for open science research projects. Another important feature of RStudio is project management. A project bundles writing, analyses, data, references, et cetera, into a self-contained folder, that can be uploaded entirely to a Git remote repository and downloaded by future users. The worcs R package installs a new RStudio project template. When a new project is initialized from this template, the bookkeeping required to set up an open science project is performed automatically.

Preparing your system

Before you can use the R implementation of the WORCS workflow, you have to install the required software. This 30 minute procedure is documented in the setup vignette. After setting up your system, you can use worcs for all of your projects. Alternatively, Docker users can follow the stetup with Docker vignette. The R implementation of the workflow introduced earlier is detailed, step-by-step, in the workflow vignette.

How WORCS helps meet the TOP-guidelines

Comprehensive citation

The TOP-guidelines encourage comprehensive citation of literature, data, materials, methods, and software. In principle, researchers can meet this requirement by simply citing every reference used. Unfortunately, citation of data and software is less commonplace than citation of literature and materials. Crediting these resources is important, because it incentivizes data sharing and the development of open-source software, supports the open science efforts of others, and helps researchers receive credit for all of their research output, in line with the San Francisco Declaration on Research Assessment, DORA.

To facilitate citing datasets, researchers sometimes publish data papers; documents that detail the procedure, sample, and codebook. Specialized journals, such as the Journal of Open Psychology Data, aid in the publication of these data papers. For smaller projects, researchers often simply share the data in a remote repository, along with a text file with the preferred citation for the data (which can be a substantive paper), and the license that applies to the data, such as Creative Commons BY-SA or BY-NC-SA. When in doubt, one can always contact the data creators and ask what the preferred citation is.

References for software are sometimes provided within the software environment; for example, in R, package references can be found by calling citation("packagename"). This returns an APA-style reference, and a BibTeX entry. Software papers are also common, and are sometimes called "Application Notes". The Journal of Open Source Software (JOSS) offers programmers a way to generate a publication based on well-documented software; in this process, the code itself is peer reviewed, and writing an actual paper is optional. For an example, see @rosenbergTidyLPAPackageEasily2018. Software citations can also sometimes be found in the source code repository (e.g., on Git remote repositories), using the 'Citation File Format'. The Software Heritage Project recently published a BibLaTeX software citation style that will further facilitate accurate citation of software in dynamically generated documents.

One important impediment to comprehensive citation is the fact that print journals operate with space constraints. Print journals often discourage comprehensive citation, either actively, or passively by including the reference list in the manuscript word count. Researchers can overcome this impediment by preparing two versions of the manuscript: One version with comprehensive citations for online dissemination, and another version for print, with only the essential citations. The print version should reference the online version, so interested readers can find the comprehensive reference list. The WORCS procedure suggests uploading the online version to a preprint server. This is important because most major preprint servers - including arXiv.org and all preprint services hosted by the Open Science Framework (OSF) - are indexed by Google Scholar. This means that authors will receive credit for cited work; even if they are cited only in the online version. Moreover, preprint servers ensure that the online version will have a persistent DOI, and will remain reliably accessible, just like the print version.

Implementation in worcs

The worcs package includes a vignette on citation to explain the fundamentals for novice users, and offer recommendations specific to worcs. We recommend using the free, open-source reference manager Zotero; it is feature-rich, user-friendly, and highly interoperable with other reference managers. A tutorial for using Zotero with RMarkdown exists here.

The worcs package also offers original solutions for comprehensive citation. Firstly, during the writing process, authors can mark the distinction between essential and non-essential references. It is easier to do this from the start, instead of going back to cut non-essential references just prior to publication. Standard Markdown uses the at-symbol (@) to cite a reference. worcs additionally reserves the "double at"-symbol (@) to cite a non-essential reference. Users can render the manuscript either with, or without, comprehensive citations by adapting the knit command in the front matter, setting it to knit: worcs::cite_all to render all citations, and to knit: worcs::cite_essential to remove all non-essential citations.

With regard to the citation of R packages, it is worth noting that the papaja package [@austPapajaPrepareReproducible2020], which provides the APA6 template used in worcs, also includes functions to automatically cite all packages used in the R session, and add their references to the bibliography file.

Data sharing

Data sharing is important for computational reproducibility and secondary analysis. Computational reproducibility means that a third party can exactly recreate the results from the original data, using the published analysis code. Secondary analysis means that a third party can conduct sensitivity analyses, explore alternative explanations, or even use existing data to answer a different research question. If it is possible to share the data, these can simply be pushed to a Git remote repository, along with documentation (such as a codebook) and the analysis code. This way, others can download the entire repository and reproduce the analyses from start to finish. From an open science perspective, data sharing is always desirable. From a practical point of view, it is not always possible.

Data sharing may be impeded by several concerns. One technical concern is whether data require special storage provisions, as is often the case with "big data". Most Git remote repositories do not support files larger than 100MB, but it is possible to enhance a Git repository with Git Large File Storage to add remotely stored files as large as several GB. Files exceeding this size require custom solutions beyond the scope of this paper.

Sharing human participant data additionally entails legal and ethical concerns. Before sharing any human participant data, it is recommended to obtain approval from qualified ethical and legal advisory organs, guidance from Research Data Management Support, and informed consent from participants. Many legislatures require researchers to 'de-identify' human subject data upon collection, by removing or deleting any sensitive personal information and contact details. However, when researchers have an imperfect understanding of the legal obligations and dispensations, they can end up over- or undersharing, and risk either placing participants at risk of being identified, or undermine the replicability of their own work and potential reuse value of their data [@phillipsDiscombobulationDeidentification2016]. Furthermore, it is important to note that the European GDPR prohibits storing 'personal data' (information which can identify a natural person whether directly or indirectly) on a server outside the EU, unless it offers an "adequate level of protection". Although different rules may apply to pseudonimized data, there are many repositories that are GDPR compliant, such as the European servers of the Open Science Framework. Different Universities, countries, and funding bodies also have their own repositories that are complient with local legistation. In sum, data should only be shared after consult qualified legal and ethical advisory organs, obtaining informed consent from participants, and de-identifying or pseudononimizing data.

If data cannot be shared, researchers should aim to safeguard the potential for computational reproducibility and secondary analysis as much as possible. WORCS recommends two solutions to accomplish this goal. The first solution is to publish a checksum of the original data file [@rivest1992md5]. Think of a checksum as a 32-character unique identifier^[It is theoretically possible but improbable that random changes to a file will result in the same checksum.], or as a lossy "summary" of the contents of a file. Any change to the file will result in a different checksum. Thus, one can use a checksum to verify the identity of a file, and ensure that its contents are unchanged. When data cannot be shared, the risk of fraud or abuse can be mitigated by publishing a checksum for the original data, in addition to the complete analysis code. Using the checksum to verify the identity of a private dataset, researchers can prove to an independent auditor that running the public analysis code on the private data results in the published results of their work.

A second solution is to share a synthetic dataset with similar characteristics to the real data. Synthetic data mimic the level of measurement and (conditional) distributions of the real data [see @nowokSynthpopBespokeCreation2016]. Sharing synthetic data allows any third party to 1) verify that the published code works, 2) debug the code, and 3) write valid code for alternative analyses. It is important to note that complex multivariate relationships present in the real data are often lost in synthetic data. Thus, findings from the real data might not be replicated in the synthetic data, and findings in the synthetic data should not be substantively interpreted. Still, sharing synthetic data facilitates data requests from third parties. A third party can write analysis code based on the synthetic data, and send it to the authors who evaluate it on the real data and send back the results. The worcs package offers a simple but flexible function to generate synthetic data, synthetic(). This function is based on the work by @nowokSynthpopBespokeCreation2016 (see the R package synthpop), but with a simpler and more flexible user interface. Specifically, the worcs function synthetic() accepts any function that can be used to model marginal associations in the data. By default, this function uses random forests, as implemented in the ranger package, to generate a model for synthesis [@wrightRangerFastImplementation2015]. Note that the ranger() function does not handle all available data types in R. Users can call custom functions to build a prediction model for unsupported data types, using the argument model_expression. For further details, see the function documentation.

Note that for data synthesis to work properly, it is essential that the class (data type) of variables is defined correctly. As of worcs version r worcs:::cran_version(), numeric, integer, factor, and logical data are supported out of the box. Other types of variables should be converted to one of these types. It is important to consider this when using special data types, such as the labeled vectors created by the haven package when reading an SPSS file. We further caution that the default random forests algorithm is computationally intensive, and may be unfeasible for use with "large" datasets. Users can provide use a custom model_expression and predict_expression to use a different algorithm when calling synthetic(). For example, users could use lm() as a model_expression to use linear regression, which preserves linear marginal relationships but can give rise to values out of range of the original data. Or users could call sample() as a predict_expression to bootstrap each variable, a very quick solution that maintains univariate distributions but loses all marginal relationships.

Processing data in WORCS

When initializing a new worcs project, an empty R script called prepare_data.R is generated. As soon as raw data are collected, researchers should use this file to document all steps necessary to load the data into R, pseudonimize it, and prepare it for analysis. As data should be shared in a format as close to raw as possible, this script should be as short as possible and only document the minimum necessary steps. For example, if the data was originally in SPSS format with IP addresses and GPS location, this file might just contain the code required to read the SPSS file, and to remove those columns of sensitive personal information. As soon as the data are pseudonimized and processed into tabular (spreadsheet-like) format, the researcher should version control some indelible record of the raw data. The concept of 'tidy' data is a helpful standard; i.e., clean, tabular data that is formatted in a way that facilitates further analysis [see @wickhamTidyData2014].

One issue of concern is the practice of "fixing" mistakes in the data. It is not uncommon for researchers to perform such corrections manually in the raw data, even if all other analysis steps are documented. Regardless of whether the data will be open or closed, it is important that the raw data be left unchanged as much as possible. This eliminates human error. Any alterations to the data - including processing steps and even error corrections - should be documented in the code, instead of applied to the raw data. This way, the code will be a pipeline from the raw data to the published results. Thus, for instance, if data have been entered manually in a spreadsheet, and a researcher discovers that in one of the variables, missing values were coded as -99 whereas they were coded as -9 in all other variables, then we would recommend adding a line of code to recode the missing values - thereby documenting the mistake - instead of "fixing" it by editing the raw spreadsheet.

The worcs package offers two functions for version controlling a record of the data: One for open, and one for closed data. Researchers should assume that the decision to make data open is irreversible. Thus, the decision should be made well before arriving at this stage in the workflow. If there is any doubt, it is prudent to proceed with closed data and make the data open once all doubt is assuaged.

In worcs, researchers can call the function open_data(data) to make the data publicly available. This function stores the data object (such as a data.frame or matrix) in a tabular data file. It also generates an accompanying codebook - a human-readable document that serves as a legend for the tabular data file - that can be viewed online. All changed files should now be added to the Git repository, committed, and pushed to the remote repository. This can be done through the Git panel in RStudio, or by running the worcs function git_update("commit message"). Once the remote repository is made public, these data will be open to the public. Assume that, once this is done, the data cannot be un-shared.

Alternatively, if the project requires data to remain closed, researchers can call closed_data(data). This function also stores the data in a local .csv file and generates a codebook, but the original data file is added to .gitignore so it cannot be accidentally added to the Git repository. The function also computes a checksum for the original data, and logs it in the .worcs project file. This checksum serves as the indelible record of the state of the raw data. The function closed_data() also calls the aforementioned synthetic() function to create a synthetic dataset. The user should push all changed files to the remote repository. Once the remote repository is made public, people will have access to a checksum for the original data that exist on your local machine, a codebook describing those data, and a synthetic copy of the data. Keep in mind that, when using closed data, the original data file exists only on the user's computer. Adequate provisions to back up the data, while respecting principles of responsible data stewardship, should be made.

As the purpose of the prepare_data.R file is to prepare data for sharing, the final line of this file should typically be either open_data() or closed_data().

After generating tidy, shareable data, this data is then loaded in the analysis code with the function load_data(). This function only loads the real data if it is available on the user's computer, and otherwise, loads the synthetic data. This will have the effect that third party users who copy a remote repository with closed data will automatically load the synthetic dataset, whereas the study authors, who have the real data stored locally, will automatically load the real data. This makes it possible for reviewers, coauthors, and auditors, to write analysis scripts without requiring access to the original data. They can simply start the script with load_data(), and write their code based on the synthetic data. Then, they can submit their code to you - by email or as a "Pull request". When you run their code on your system, load_data() will load the original data, and use that to run their script. You can then return the results to the third party.

Note that these functions can be used to store multiple data files, if necessary. As of version 0.1.2, the worcs package only stores data as .csv files. As explained above, this is recommended, because .csv files are human- and machine readable, and because data typically need to be in tabular format for analysis. Many types of data can be represented as a table; for example, text corpora can have one row per document, and EEG- or ECG waveforms can have a row per measurement occasion and a column per channel. The prepare_data.R file should document any steps required to convert these data into tabular format. If this is not an appropriate format for the data, readers are encouraged to follow the development of the repro package [@repro], which will offer greater support for reproducibly storing and retrieving diverse data formats.

Some users may intend to make their data openly available through a restricted access platform, such as an institutional repository, but not publicly through a Git remote repository. In this case, it is recommended to use closed_data(), and to manually upload the original data.csv file to the restricted access platform. If users wish to share their data through the Open Science Framework, it is sufficient to connect the OSF page to the Git remote repository as an Add-on.

Sharing code, research materials, design and analysis

When writing a manuscript created using dynamic document generation, analysis code is embedded in the prose of the paper. Thus, the TOP-guideline of sharing analysis code can be met simply by committing the source code of the manuscript to the Git repository, and making this remote repository Public. If authors additionally use open data and a reproducible environment (as suggested in WORCS), then a third party can simply replicate all analyses by copying the entire repository from the Git hosting service, and Knitting the manuscript on their local computer.

Aside from analysis code, the TOP-guidelines also encourage sharing new research materials, and details of the study design and analysis. These goals can be accomplished by placing any such documents in the Git repository folder, committing them, and pushing to a cloud hosting service. As with any document version controlled in this way, it is advisable (but not required) to use plain text only.

Preregistration

Lindsay and colleagues [-@lindsayResearchPreregistration1012016] define preregistration as "creating a permanent record of your study plans before you look at the data. The plan is stored in a date-stamped, uneditable file in a secure online archive." Two such archives are well-known in the social sciences: AsPredicted.org, and OSF.io. However, Git cloud hosting services also conform to these standards. Thus, it is possible to preregister a study simply by committing a preregistration document to the local Git repository, and pushing it to the remote repository. Subsequently taging the release as "Preregistration" on the remote repository renders it distinct from all other commits, and easily findable by people and programs. This approach is simple, quick, and robust. Moreover, it is compatible with formal preregistration through services such as AsPredicted.org or OSF.io: If the preregistration is written using dynamic document generation, as recommended in WORCS, this file can be rendered to PDF and uploaded as an attachment to the formal preregistration service.

The advantages, disadvantages, and pitfalls for preregistering different types of studies have been extensively debated elsewhere [see @lindsayResearchPreregistration1012016]. For example, because the practice of preregistration has historically been closely tied to experimental research, it has been a matter of some debate whether secondary data analyses can be preregistered [but see @westonRecommendationsIncreasingTransparency2019 for an excellent discussion of the topic].

When analyzing existing data, it is difficult to prove that a researcher did not have direct (or indirect, through collaborators or by reading studies using the same data) exposure to the data, before composing the preregistration. However, the question of proof is only relevant from a perspective of preventing scientific misconduct. Preregistration is a good way to ensure that a researcher is not "HARKing": Hypothesizing after the results are known [@kerrHARKingHypothesizingResults1998].

Good faith preregistration efforts always improve the quality of deductive (theory-testing) research, because they avoid HARKing, ensure reliable significance tests, avoid overfitting noise in the data, and limit the number of forking paths researchers wander during data analysis [@gelmanStatisticalCrisisScience2014].

WORCS takes the pragmatic position that, in deductive (hypothesis-testing) research, it is beneficial to plan projects before executing them, to preregister these plans, and adhere to them. All deviations from this procedure should be disclosed. Researchers should minimize exposure to the data, and disclose any prior exposure, whether direct (e.g., by computing summary statistics) or indirect (e.g., by reading papers using the same data). Similarly, one can disclose any deviations from the analysis plan to handle unforeseen contingencies, such as violations of model assumptions; or additional exploratory analyses.

WORCS recommends documenting a verbal, conceptual description of the study plans in a text-based file. Optionally, an analysis script can be preregistered to document the planned analyses. If any changes must be made to the analysis code after obtaining the data, one can refer to the conceptual description to justify the changes. The ideal preregistered analysis script consists of a complete analysis that can be evaluated once the data are obtained. This ideal is often unattainable, because the data present researchers with unanticipated challenges; e.g., assumptions are violated, or analyses work differently than expected. Some of these challenges can be avoided by simulating the data one expects to obtain, and writing the analysis syntax based on the simulated data. This topic is beyond the scope of the present paper, but many user-friendly methods for simulating data are available in most statistical programming languages. For instance, R users can use the package simstudy [@goldfeldSimstudySimulationStudy2020].

As soon as a project is preregistered on a Public Git repository, it is visible to the world, and reviewers (both formal reviewers designated by a journal, and informal reviewers recruited by other means) can submit comments, e.g., through "Pull requests" on a Git remote repository. If the remote repository is Private, Reviewers can be invited as "Collaborators". It is important to note that contributing to the repository in any way will void reviewers' anonymity, as their contributions will be linked to a user name. Private remote repositories can be made public at a later date, along with their entire time-stamped history and tagged releases.

Implementation of preregistration in worcs

The worcs workflow facilitates preregistration, primarily by explicitly inviting a user to write the preregistration document. To this end, worcs imports several preregistration templates from the R package prereg [@austPreregMarkdownTemplates2019], including templates from organizations like AsPredicted.org and OSF.io, and from researchers [e.g., @vantveerPreregistrationSocialPsychology2016]. When initializing an RStudio project with the worcs project template, one of these preregistration templates can be selected, which will generate a file called preregistration.Rmd. This file should be used to document study plans, ideally prior to data collection. Within the workflow, the Git cloud hosting services can be used to preregister a study simply by pushing a text document with the study plans to a Git remote repository. By tagging the commit as a preregistration (see the workflow vignette), it is distinct from all other commits, and easily findable by people and machines.

Compatibility with other standards for open science

Aside from the TOP Guidelines, the FAIR Guiding Principles for scientific data management and stewardship [@wilkinsonFAIRGuidingPrinciples2016a] are an important standard for open science. These principles advocate that digital research objects should be Findable, Accessible, Interoperable and Reusable. Initially mainly promoted as principles for data, they are increasingly applied as a standard for other types of research output as well, most notably software [@lamprechtFAIRPrinciplesResearch2019].

WORCS was not designed to comprehensively address these principles, but the workflow does facilitate meeting them. In terms of Findability, all files in GitHub projects are searchable by content and meta-data, both on the platform and through standard search engines. The workflow further recommends that users create a project page on the Open Science Framework – a recognized repository in the social sciences. When doing so, it is important to generate a Digital Object Identifier (DOI) for this project. A DOI is a persistent way to identify and connect to an object on the internet, which is crucial for findability. Additional DOIs can be generated through the OSF for specific resources, such as data sets. It is further worth noting that, optionally, GitHub repositories can be connected to Zenodo: On Zenodo, users can store a snapshot of the repository, provide metadata, and generate a DOI for the project or specific resources. When a project is connected to the OSF, as recommended, these steps may be redundant. Finally, each worcs project has a [YAML file] (https://yaml.org/) that lists the data objects (and other meta-data) in the project, which will allow web crawlers to index worcs projects. We use this metadata, for example, to identify public worcs projects on GitHub.

Accessibility is primarily safeguarded by hosting all project files in a public GitHub repository, which can be downloaded as a ZIP archive, and cloned or forked using the Git protocol. GitHub is committed to long term accessibility of these resources. We further encourage users to make their data accessible when possible. When data must be closed, the worcs function closed_data() adds a paragraph to the project readme, indicating how readers can contact the author for access.

With regard to interoperability, the workflow recommends using plain-text files whenever possible, which ensures that code and meta-data are human- and machine-readable. Furthermore, the worcs package is implemented in R, which is open source. To promote reusability, we recommend users to specify an appropriate license for their projects. The worcs package includes the Creative Commons (CC) licenses, and suggests a CC-BY 4.0 license by default. This license is featured in the GitHub repository, and we encourage users to also display it on their project’s OSF page. Note that it is possible to license specific project resources separately under different licenses. For instance, whereas the CC-licenses are suitable for data, media, and academic papers - they are ill-suited for licensing software. It is thus prudent to consider separately licensing specific project resources, particularly when a project contains original source code (i.e., you write your own functions). For a more extensive discussion of appropriate licensing, see @peikertReproducibleDataAnalysis2019, @stoddenEnhancingReproducibilityComputational2016, and the website choosealicense.com.

To further facilitate reusability, worcs automatically generates a codebook for each data file. We encourage users to elaborate on the default codebook by adding variable description and categories, which would - in the future - enable indexing data by topic area.

Incidentally, the FAIR principles have been applied to software as well, and worcs follows these recommendations for FAIR research software. It is hosted on a version controlled public remote repository GitHub, has an open source licence (GPL v3.0), is registered in a community registry (CRAN), enables the citation of the software (using the citation("worcs") command, or by citing this paper), and followed the CII Best Practices software quality checklist during development.

Sharing all research objects, as advocated in WORCS, also provides a thorough basis for research evaluation according to the San Francisco Declaration on Research Assessment (DORA; https://sfdora.org/), which plays an increasing role in grant funding, hiring, and promotion procedures. Direct access to research objects allows stakeholders to evaluate research quality based on content rather than relying on spurious surrogate indicators like journal impact factors, conference rankings, and h-indexes. The detailed version control and commit tracking of Git remote repositories furthermore make it possible to assess the relative contributions made by different researchers.

Discussion

In this tutorial paper, we have presented a workflow for open reproducible code in science. The workflow aims to lower the threshold for grass-roots adoption of open science principles. The workflow is supported by an R package with an RStudio project template and convenience functions. This relatively light-weight workflow meets most of the requirements for open science as detailed in the TOP-guidelines, and is compatible with other open science guidelines. The workflow helps researchers meet existing requirements for open science, and to reap the benefits of working openly and reproducibly, even when such requirements are absent.

Comparing WORCS to existing solutions

There have been several previous efforts to promote grass-roots adoption of open science principles. Each of these efforts has a different scope, strengths, and limitations that set it apart from WORCS. For example, there are "signalling solutions"; guidelines to structure and incentivize disclosure about open science practices. Specifically, @aalbersbergMakingScienceTransparent2018 suggested publishing a "TOP-statement" as supplemental material, which discloses the authors' adherence to open science principles. Relatedly, @aczelConsensusbasedTransparencyChecklist2019 developed a consensus-based Transparency Checklist that authors can complete online to generate a report. Such signalling solutions are very easy to adopt, and they address TOP-guidelines 1-7. Many journals now also offer authors the opportunity to earn "badges" for adhering to open science guidelines [@kidwellBadgesAcknowledgeOpen2016]. These signalling solutions help structure authors' disclosures about, and incentivize adherence to, open science practices. WORCS, too, has its own checklist that calls attention to a number of concrete items contributing to an open and reproducible research project. Users of the worcs package can receive a badge, displayed on the README.md file, based on a semi-automated scoring of the checklist items, by calling the function check_worcs() within a worcs project directory.

A different class of solutions instead focuses on the practical issue of how researchers can meet the requirements of open science. A notable example is the workflow for reproducible analyses developed by @peikertReproducibleDataAnalysis2019, which uses Docker to ensure strict computational reproducibility for even the most sophisticated analyses. There are some decisive differences with WORCS. First, concerning the scope of the workflow: WORCS is designed to address a unique issue not covered by other existing solutions, namely, to provide a workflow most conducive to satisfying TOP-guidelines 1-7 while adhering to the FAIR principles. Peikert and Brandmaier instead focus primarily on computational reproducibility, which is relevant for TOP-guidelines 2, 3, and 5. Second, with regard to ease of use, WORCS aims to bring down the learning curve by adopting sensible defaults for any decisions to be made. Peikert and Brandmaier, by contrast, designed their workflow to be very flexible and comprehensive, thus requiring substantially more background knowledge and technical sophistication from users. To sum up, WORCS builds upon the same general principles as Peikert and Brandmaier, and the two workflows are compatible. What sets WORCS apart is that it is more lightweight in terms of system and user requirements, thereby facilitating adoption, but still ensures computational reproducibility under most circumstances.

One initiative that WORCS is fully compatible with, is the Scientific Paper of the Future. This organization encourages geoscientists to document data provenance and availability in public repositories, document software used, and document the analysis steps taken to derive the results. All of these goals could be met using WORCS.

Limitations

WORCS is intended to substantially reduce the threshold for adopting best practices in open and reproducible research. However, several limitations and issues for future development remain. One potential challenge is the learning curve associated with the tools outlined in this paper. Learning to work with R, RMarkdown, and Git takes time - although tutorials such as this one substantially reduce the learning curve. Moreover, the time investment tends to pay off. Working with R opens the door to many cutting edge analysis techniques. Working with RMarkdown saves time and prevents mistakes by avoiding tedious copying of results into a text document. Working with Git keeps projects organized, prevents accidental loss of work, enables integrating changes by collaborators in a non-destructive manner, and ensures that entire research projects are archived and can be accessed or copied by third parties. Thus, the time investment is eminently worthwhile.

Another limitation is the fact that no single workflow can suit all research projects. Not all projects conform to the steps outlined here; some may skip steps, add steps, or follow similar steps in a different order. In principle, WORCS does not enforce a linear order, and allows extentensive user flexibility. A related concern is that some projects might require specialized solutions outside of the scope of this paper. For example, some projects might use very large data files, rely on data that reside on a protected server, or use proprietary software dependencies that cannot be version controlled using renv or Docker. In such cases, the workflow can serve as a starting point for a custom approach; offering a primer on best practices, and a discussion of relevant factors to consider. We urge users to inform us of such challenges and custom solutions, so that we may add them to the overview of WORCS-projects on GitHub, where they can serve as an example to others.

Another important challenge is managing collaborations when only the lead author uses RMarkdown, and the coauthors use Word. When using the papaja package to write APA-style papers, it is possible to Knit the manuscript to Word (.docx), by changing the line output: papaja::apa6_pdf in the manuscript's YAML header to output: papaja::apa6_docx. There are some limitations to the conversion, discussed here. When soliciting feedback from co-authors, ask them to use Track Changes and comment bubbles in Word. Then, manually incorporate their changes into the manuscript.Rmd file. In most cases, this is the most user-friendly approach, as most lead authors would review changes by co-authors anyway. A second approach is to ask collaborators to work in plain text. In this case, send collaborators the manuscript.Rmd file, and ask them to open (and save) it in Word or Notepad as a plain text file. When they send it back, make sure any changes to your local file are committed, and then simply overwrite your local version with their file. In RStudio, select the file in the Git tab, and click the Diff button to examine what changes the collaborator has made relative to your last committed version.

If all collaborators are committed to using WORCS, they can Fork the repository from the lead author on GitHub, clone it to their local device, make their own changes, and send a pull request to incorporate their changes. Working this way is extremely conducive to scientific collaboration [@ramGitCanFacilitate2013]. Recall that, when using Git for collaborative writing, it is recommended to insert a line break after every sentence so that the change log will indicate which specific sentence was edited, and to prevent "merge conflicts" when two authors edit the same line. The resulting document will be rendered to PDF without spurious line breaks.

Being familiar with Git remote repositories opens doors to new forms of collaboration: In the open source software community, continuous peer review and voluntary collaborative acts by strangers who are interested in a project are commonplace [see @adolphOpenBehavioralScience2012]. This kind of collaboration is budding in scientific software development as well; for example, the lead author of this paper became a co-author on several R packages after submitting pull requests with bug fixes or additional functionality [@rosenbergTidyLPAPackageEasily2018; @hallquistMplusAutomationPackageFacilitating2018], and two co-authors of this paper became involved by contributing pull requests to worcs. It is also possible to invite such collaboration by opening Issues for tasks that still need to be accomplished, and tag known collaborators to address them, or invite external collaborators to contribute their expertise.

A final limitation is that, as WORCS is relatively new, the number of current users is still limited. We therefore do not have elaborate insights in user experiences or feedback. We strive to make the uptake of worcs as easy as possible, by means of this background paper, extensive documentation and vignettes for the R implementation, online video tutorials, webinars and exemplar use cases. We encourage our users to provide us with feedback, feature requests, or bug reports through the channels indicated here.

Future developments

WORCS provides a user-friendly and lightweight workflow for open, reproducible research, that meets all TOP-guidelines pertaining to reproducibility (1-7). Nevertheless, there are clear directions for future developments. Firstly, although the workflow is currently implemented only in R, it is conceptually relevant for users of other statistical programming languages, and we welcome efforts to implement WORCS in other platforms. We welcome efforts to implement WORCS in other platforms. Secondly, even when a project is in R, it may have dependencies outside of the R environment that cannot be managed using renv. It is beyond the scope of the worcs package to support tools outside of R, or to containerize a project so that it can be identically reinstated on a different system or virtual machine. The forthcoming package repro [@repro] will offer such extensions of the workflow, thus combining the strengths of WORCS with the solutions proposed by @peikertReproducibleDataAnalysis2019.

Conclusion

WORCS offers a workflow for open reproducible code in science. The step-by-step procedure outlined in this tutorial helps researchers make an entire research project Open and Reproducible. The accompanying R package provides user-friendly support functions for several steps in the workflow, and an RStudio project template to get the project set up correctly.

WORCS encourages and simplifies the adoption of open science principles in daily scientific work. It helps researchers make all research output created throughout the scientific process - not just manuscripts, but data, code, and methods - open and publicly assessible. This enables other researchers to reproduce results, and facilitates cumulative science by allowing others to make direct use of these research objects.

References

cjvanlissa/worcs documentation built on Nov. 7, 2024, 1:20 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

cjvanlissa/worcs
Workflow for Open Reproducible Code in Science

In cjvanlissa/worcs: Workflow for Open Reproducible Code in Science

Introduction

Defining open science practices

Introducing the tools

Dynamic document generation

Version control

Dependency management

Text-based files are better

Introducing the workflow

The R implementation of WORCS

Preparing your system

How WORCS helps meet the TOP-guidelines

Comprehensive citation

Implementation in worcs

Data sharing

Processing data in WORCS

Sharing code, research materials, design and analysis

Preregistration

Implementation of preregistration in worcs

Compatibility with other standards for open science

Discussion

Comparing WORCS to existing solutions

Limitations

Future developments

Conclusion

References

R Package Documentation

Browse R Packages

We want your feedback!

cjvanlissa/worcs Workflow for Open Reproducible Code in Science

In cjvanlissa/worcs: Workflow for Open Reproducible Code in Science

Introduction

Defining open science practices

Introducing the tools

Dynamic document generation

Version control

Dependency management

Text-based files are better

Introducing the workflow

The R implementation of WORCS

Preparing your system

How WORCS helps meet the TOP-guidelines

Comprehensive citation

Implementation in worcs

Data sharing

Processing data in WORCS

Sharing code, research materials, design and analysis

Preregistration

Implementation of preregistration in worcs

Compatibility with other standards for open science

Discussion

Comparing WORCS to existing solutions

Limitations

Future developments

Conclusion

References

R Package Documentation

Browse R Packages

We want your feedback!

cjvanlissa/worcs
Workflow for Open Reproducible Code in Science