library("papaja") mask <- TRUE masked <- function(text){ ifelse(mask, "[masked]", text) }
# Seed for random number generation set.seed(42)
\setlength{\parskip}{0em}
Academia is arguably past the tipping point of a paradigm shift towards open science. Several immediate causes rallied support for this transition, including highly publicized cases of scientific fraud [as discussed by @leveltFailingScienceFraudulent2012], increasing awareness of questionable research practices and their consequences for the credibility of findings [@johnMeasuringPrevalenceQuestionable2012], and the replication crisis [@shroutPsychologyScienceKnowledge2018]. However, open science should not be seen as a cure (or punishment) for this crisis. As the late dr. Jonathan Tennant put it: "Open science is just good science" [@tennantOpenScienceJust2018]. Open science creates opportunities for researchers to more easily conduct reliable, cumulative, and collaborative science [@nosekScientificUtopiaOpening2012; @adolphOpenBehavioralScience2012]. Open science also promotes inclusivity, because it removes barriers for participation. Capitalizing on these advances has the potential to accelerate scientific progress [see also @coyneReplicationInitiativesWill2016].
Many researchers are motivated to adopt current best practices for open
science and enjoy these benefits. And yet, the question of "how" to
apply open science practices can be daunting. Making the transition
requires researchers to become knowledgeable about different open science
challenges and solutions, and to become proficient with new and unfamiliar tools.
This paper is designed to ease that transition by presenting a simple workflow
that meets most requirements open science, and is based on best practices:
The Workflow for Open Reproducible Code in Science (WORCS).
WORCS is a lightweight approach to open science and
computational reproducibility for scholars working in R on a private computer,
conducting research that involves academic prose, analysis code,
and (optionally) tabular data - but it is flexible enough to accommodate many
other scenarios. The workflow (Figure \@ref(fig:workflow))
constitutes a step-by-step procedure that researchers can follow to make a
research project (including data, code, and a report) open and reproducible.
WORCS is compatible with open science
requirements already implemented by journals and institutions, and will help
fulfill them. It can also be used in the absence of top-down support or guidelines
to produce work in accordance with best practices.
Although the principles underlying this workflow are universal, WORCS has first
been implemented for R users [@rcoreteamLanguageEnvironmentStatistical2020].
The R package worcs
[@worcspackage2020] offers a project template for RStudio
[@rstudioteamRStudioIntegratedDevelopment2015], which is, effectively,
a "one click solution". This paper introduces the WORCS workflow, discusses how
it meets best practices for open science, and illustrates the use of the worcs
package.
knitr::include_graphics("workflow_graph/workflow.png")
Although Open Science is advocated by many, it does not have a unitary definition [for an effort to ground open science in first principles, see @tennantValuePropositionOpen2020]. Instead, several guidelines exist for practices that are considered "open". The TOP-guidelines are one of the most influential operationalisations of Open Science principles [@nosekPromotingOpenResearch2015a]. These guidelines describe eight standards for open science: 1) Comprehensive citation of literature, data, materials, and methods; 2) sharing data, 3) sharing the code required to reproduce analyses, 4) sharing new research materials, and 5) sharing details of the design and analysis; 6) pre-registration of studies before data collection, and 7) pre-registration of the analysis plan prior to analysis; and 8) replication of published results. WORCS defines the goals of open science in terms of these guidelines, and is designed to facilitate meeting each of these guidelines, with one exception: We do not address replication of published results, because replication relates to the subject of a research project, not to its organization or execution. We group the remaining seven guidelines into three categories: citation (1), sharing (2-5), and preregistration (6-7).
WORCS relies on several free, open source software solutions which we will discuss before introducing the workflow.
The first is dynamic document generation (DDG): Writing scientific reports in a format that interleaves written reports with blocks of code used to conduct the analyses. The text is automatically formatted as a scientific paper in several potential styles. When the text is formatted, the code blocks are evaluated and their results are insterted in the text, or rendered as figures and tables. Dynamic document generation supersedes the classical approach of using separate programs to write prose and conduct analyses, and then manually copy-pasting analysis results into the text. The source code of this paper can serve as an example of DDG, and is available here.
Although there is a slight learning curve to transitioning to DDG, we strongly believe that the investment will pay off for users of the workflow, as it did for us. Time saved from painstakingly copy-pasting output and manually formatting text soon outweighs the investment of switching to a new program. Moreover, human error in manually copying results is eliminated. When revisions require major changes to the analyses, all results, figures and tables can be automatically updated. The flexibility in output formats also means that a manuscript can be rendered to presentation format, or even to a website or blog post. Moreover, the fact that code can be run each time the document is compiled encourages researchers to work reproducibly, and allows reviewers and/or readers verify reproducibility simply by re-compiling the document. In sum, while writing academic papers in a programming environment might seem counter-intuitive at first, this approach is much more amenable to the needs of academics than most word processing software. It prevents mistakes, and saves time.
In the R-implementation of WORCS, we recommend centering a research project
around one dynamically generated 'RMarkdown' document, which includes all
analysis code. Longer scripts can be
stored in .R
files, and included in the main document using the read_chunk()
function from the knitr
package, which comes together with the rmarkdown
package. When a reader or reviewer compiles this document, all code is run
automatically, thus verifying computational replicability. The 'RMarkdown'
document can be automatically formatted in many styles, including
APA style [thanks to the R package papaja
, @austPapajaPrepareReproducible2020],
a host of other scientific formats
[see @allaireRticlesArticleFormats2020], and as plain 'Markdown'.
The second solution is version control: Maintaining an indelible log of every change to all project files. Version control is a near-essential tool for scientific reproducibility, as anyone learns who has had the experience of accidentally deleting a crucial file, or of being unable to reproduce analyses because they ran some analyses interactively, instead of documenting the syntax in a script file [see also @blischakQuickIntroductionVersion2016]. Many scientists use some form of implicit version control; for example, by renaming files after major changes (e.g., "manuscript_final_2.2-2019-10-12.doc"), tracking changes in word processing software, or using cloud hosting services that retain backups of previous versions.
An integral part of WORCS is the explicit version control software 'Git'
(www.git-scm.com). Git tracks
changes to files, and stores these changes when the user makes a "commit" (a
snapshot of the version controlled files).
Git retains a complete log of all commits, and users can compare changes between
different commits, or go back to a previous version of the code (for example,
after making a mistake, or to replicate a previous version of the results).
A project version controlled with Git is called a "Repository", or Repo.
Git only version controls files explicitly committed by the user. Moreover, it
is possible to prevent files from being version controlled - which is useful for
privacy sensitive data.
A .gitignore
file lists files that should not be version-controlled.
These files thus exist only on the user's private computer.
The functionality of 'Git' is amplified by services such as 'GitHub' (https://github.com). 'GitHub' is best understood as a cloud storage service with social networking functionality. The cloud storage aspect of 'GitHub' works as follows: You can "clone" (copy) a local 'Git' repository to the 'GitHub' website, as a backup or research archive. The social network aspect comes into play when a repository on 'GitHub' is set to "public": This allows other researchers to peruse the repository and see how the work was done; clone it to their own computer to replicate the original work or apply the methods to their own data; open "Issues" to ask questions or give feedback on the project, or even send a "Pull request" with suggested changes to the text or code for your consideration. 'Git' and 'GitHub' shine as tools for collaboration, because different people can simultaneously work on different parts of a project, and their changes can be compared and automatically merged on the website. Even on solo projects, working with 'Git'/'GitHub' has many benefits: Staying organized, being able to start a new study with a clone of an old, similar repository, or splitting off an "experimental branch" to try something new, while retaining the ability to "revert" (return) to a previous state of the project, or to "merge" (incorporate) the experimental branch.
It is worth noting that,
although this paper focuses on the most widely used 'Git'
remote repository service, 'GitHub', two notable alternatives exist:
'GitLab' (https://gitlab.com) and 'Bitbucket' (https://bitbucket.org).
In
order to present the workflow in a platform-agnostic way, we will refer to such
cloud hosting services as "remote repositories" in the remainder of this
paper.
As of
version 0.1.2, worcs
has been tested with these three remote repository services,
but has been optimized for use with 'GitHub'.
A vignette is included
with instructions for using alternative platforms.
When initializing a new project using the worcs
project template,
a 'https' remote repository address will usually work best with the authentication procedure outlined in the
setup vignette.
The R-implementation of WORCS uses the package gert
, a 'Git' client for 'R'
[@oomsGertSimpleGit2019], to connect a local project to a remote repository.
The worcs
package also contains a user-friendly function to set
the default user credentials, git_user()
, and a function to add, commit, and
push all changed files to a remote repository in one step, git_update()
.
The third solution is dependency management: Keeping track of exactly what software was used to conduct the analyses. At first glance, it might seem sufficient to state that analyses were conducted in Program X. However, every program is susceptible to changes, updates, and bugfixes. A lot of Open Source software, in particular, is regularly updated because there is an active community of developers contributing functionality and bugfixes. Potentially, any such update could change the results of the code, thus rendering the analysis 'computationally non-reproducible'.
Many solutions exist to ensure computational reproducibility. These solutions typically work by
enveloping your research project in a distinct "environment" that only has access to programs that
are explicitly installed, and maintaining a record of these programs. These solutions differ in
user-friendliness and effectiveness. If strict reproducibility is required, users might choose to conduct all analyses in a 'Docker container': An environment that behaves like a
virtual computer, which can be stored like a sort of time capsule, and
identically reinstated on a user computer, or in the cloud.
This is effective, and also preserves software outside the R environment (e.g.,
preprocessing tools for neuroimaging data). A Docker-based workflow
was recently introduced by @peikertReproducibleDataAnalysis2019. This workflow
is based on many of the same principles as WORCS, and the two approaches are
conceptually and practically compatible. However, the workflow described by
Peikert and Brandmaier can be difficult to set up for novice users.
The accompanying R package repro
is currently in development to facilitate this transition [@repro]. Another
user-friendly Docker-based workflow is offered by the cloud-based collaboration
platform 'Code Ocean'. Although Code Ocean enjoys some popularity among
scientists, it is not clear whether this platform
is GDPR compliant - as of this writing, its terms of use do not address the GDPR, nor is it clear where
data are stored. The necessity to upload data for analysis is a risk with any
cloud-computing solution when human participants are involved.
For analyses conducted entirely in R, the recently released package renv
strikes a great balance between user-friendliness and effectiveness. Developed by
the team behind 'RStudio', this package simply maintains a text-based,
human-readable log of all packages used, their version numbers, and where they
were installed from (e.g., 'CRAN', 'Bioconductor', 'GitHub'). As long as these
repositories maintain their archives, this solution is reliable and very
lightweight.
The text-based log file can be version controlled with 'Git'. renv
installs the logged packages in a cache that is shared between all
projects on the computer. Thus, if one of your projects uses a package that is
already in the cache, it does not
need to be installed again. This prevents long installation
times, large space requirements, and frustration having to re-install software
that is already installed on the system. When someone else loads your
project, renv
will install all of the required packages onto a cache on their
computer.
This is not a foolproof method to ensure strict computational reproducibility, but it is more user-friendly
and lightweight than any alternative out there, requiring no technical skill to
setup. When choosing an appropriate solution for dependency management, it is
important to consider that all solutions have a limited shelf life
[@brownHowLearnedStop2017].
Ultimately, the best approach might be to use a "just good-enough" solution, and
acknowledge that all code requires some maintenance if you want to
reproduce it in the future. For projects in R, renv
meets these requirements.
A key consideration when developing a research project is what filetypes to use.
A case can be made to use text-based files exclusively (or as much as possible),
instead of binary files.
Text-based files can be read by machines and humans alike.
Binary files, such as 'Word' (.docx
) or 'SPSS' (.sav
, .spo
) files,
must be decoded first, and many require commercial software to do so.
The version control software 'Git' is also designed for use with text-based files.
The 'Git' change log reflects changes to text-based files on a line-by-line basis.
Binary files can be version controlled with 'Git', but
the change log for a binary file are uninterpretable.
Binary files are also often
larger than text-based files, which means they take up more space on cloud
hosting services. For these reasons, uploading large binary
files to remote repositories is frowned upon.
WORCS encourages the use of
text-based files for all purposes.
Two additional points are worth noting:
First, 'Git' tracks line-by-line changes to text-based files,
based on line breaks in the text.
When writing a version-controlled paper in plain-text,
it is therefore recommended to start each sentence on a new line.
That way, the change log will indicate which specific sentence was edited,
instead of replacing an entire paragraph.
As RMarkdown does not parse a single line break as the end of a paragraph,
this practice does not disrupt the flow of a paragraph in the rendered version.
Second, the remote repository service 'GitHub' rewards the use of text-based files by
rendering certain text-based filetypes for online viewing:
For example, .md
(Markdown) files are displayed as web pages, and .csv
files as spreadsheets.
The worcs
implementation for R makes use of this fact by, for example,
rendering a codebook for the data as .md
so visitors of a 'Git' remote repository
can view it as a web page.
We provide a conceptual outline of the WORCS procedure below, based on WORCS
Version 0.1.2.
For users of the worcs
R package, we highly recommend referring to
the workflow vignette,
which is more detailed and updated with future developments.
get_workflow <- function(filename = "../vignettes/workflow.Rmd"){ txt <- readLines(filename) txt <- txt[grepl("<!--S:", txt, fixed = T)|grepl("^###", txt)] txt <- gsub("(<!--S: |-->.*$)", "", txt) txt <- gsub("^#", "", txt) out <- vector("character") for(i in seq_along(txt)){ if(startsWith(txt[i], "#")){ out <- append(out, c("", txt[i], "")) } else { out <- append(out, txt[i]) } } cat(out, sep = "\n") } get_workflow()
The R package worcs
automates most steps of the workflow
for R users.
Several arguments support the choice to implement this workflow in R.
First, R and all of it extensions are free open source software, which
make it a tool of choice for open science.
Second, all tools required for an open science workflow are implemented in R,
most of these tools are directly accessible through the user interface of RStudio,
and some of them are actively developed by the team behind RStudio.
Third, R is the second-most cited statistical software package [@muenchenPopularityDataScience2012], following SPSS, which has no support for any of the open science tools discussed in this paper.
Fourth, R is well-supported;
a vibrant and inclusive online community exists for R,
developing new R methods and packages,
and providing support and tutorials for existing ones.
Finally, R is highly interoperable:
Packages are available to load nearly every imaginable filetype, and output can be written to most file types, including DOCX and PDF (in APA style), and HTML format.
Moreover, wrappers are available for many tools developed in other programming languages, and code written in other programming languages can be evaluated from R, including C++,
Fortran, and Python [@allaireMarkdownPythonEngine2020]. There are excellent free
resources for learning R, such as the book 'R for Data Science' [@grolemundDataScience2017].
Working with R is simplified immensely by using the RStudio integrated
development engine (IDE) for R [@rstudioteamRStudioIntegratedDevelopment2015].
RStudio automates and streamlines tedious or complicated aspects of working with
the tools used in WORCS, and many are embedded directly into the visual user
interface.
This makes RStudio a comprehensive solution for open science research projects.
Another important feature of RStudio is project management.
A project bundles writing, analyses, data, references, et cetera, into a
self-contained folder, that can be uploaded entirely to a 'Git' remote repository and downloaded
by future users. The worcs
R package installs a new RStudio project template.
When a new project is initialized from this template, the bookkeeping required
to set up an open science project is performed automatically.
Before you can use the R-implementation of the WORCS workflow, you have to install the required software.
You only have to perform this setup procedure only once for each computer you
intend to use worcs
on. It should take about 30 minutes, and it is documented
in the setup vignette.
The setup vignette does not assume a prior installation of R, so it is suitable for novice users.
After setting up your system, you can use worcs
for all of your
projects. The R-implementation of the workflow introduced earlier is detailed,
step-by-step, in
the workflow vignette.
The TOP-guidelines encourage comprehensive citation of literature, data, materials, methods, and software. In principle, researchers can meet this requirement by simply citing every reference used. Unfortunately, citation of data and software is less commonplace than citation of literature and materials. Crediting these resources is important, because it incentivizes data sharing and the development of open-source software, supports the open science efforts of others, and helps researchers receive credit for all of their research output, in line with the San Francisco Declaration on Research Assessment, DORA.
To facilitate citing datasets, researchers sometimes publish data papers; documents that detail the procedure, sample, and codebook. Specialized journals, such as the Journal of Open Psychology Data, aid in the publication of these data papers. For smaller projects, researchers often simply share the data in a remote repository, along with a text file with the preferred citation for the data (which can be a substantive paper), and the license that applies to the data, such as Creative Commons BY-SA or BY-NC-SA. When in doubt, one can always contact the data creators and ask what the preferred citation is.
References for software are sometimes provided within the software environment;
for example, in R, package references can be found by
calling citation("packagename")
. This returns an APA-style reference, and a
BibTeX entry. Software papers are also common, and are sometimes called
"Application Notes". The Journal of Open Source Software (JOSS)
offers programmers a way to generate a publication
based on well-documented software;
in this process, the code itself is peer reviewed, and writing an actual paper is optional. For an example, see
@rosenbergTidyLPAPackageEasily2018.
Software citations can also sometimes be found
in the source code repository (e.g., on 'Git' remote repositories), using the
'Citation File Format'. The
Software Heritage Project
recently published a BibLaTeX software citation style that will further facilitate
accurate citation of software in dynamically generated documents.
One important impediment to comprehensive citation is the fact that print journals operate with space constraints. Print journals often discourage comprehensive citation, either actively, or passively by including the reference list in the manuscript word count. Researchers can overcome this impediment by preparing two versions of the manuscript: One version with comprehensive citations for online dissemination, and another version for print, with only the essential citations. The print version should reference the online version, so interested readers can find the comprehensive reference list. The WORCS procedure suggests uploading the online version to a preprint server. This is important because most major preprint servers - including arXiv.org and all preprint services hosted by the Open Science Framework (OSF) - are indexed by Google Scholar. This means that authors will receive credit for cited work; even if they are cited only in the online version. Moreover, preprint servers ensure that the online version will have a persistent DOI, and will remain reliably accessible, just like the print version.
worcs
The worcs
package includes a vignette on citation to explain the
fundamentals for novice users, and offer recommendations specific to worcs
. We
recommend using the free, open-source reference manager
Zotero; it is feature-rich,
user-friendly, and highly interoperable with other reference managers. A
tutorial for using Zotero with RMarkdown exists here.
The worcs
package also offers original solutions for comprehensive citation.
Firstly,
during the writing process, authors can mark the distinction between essential
and non-essential references. It is easier to do this from the start, instead of
going back to cut non-essential references just prior to publication. Standard
Markdown uses the at-symbol (@
) to cite a reference. worcs
additionally
reserves the "double at"-symbol (@@
) to cite a non-essential reference.
Users can render the manuscript either with, or without, comprehensive citations
by adapting the knit
command in the front matter, setting it to
knit: worcs::cite_all
to render all citations, and to
knit: worcs::cite_essential
to remove all non-essential citations.
With regard to the citation of R packages, it is worth noting that the papaja
package [@austPapajaPrepareReproducible2020], which provides the 'APA6'
template used in worcs
, also
includes functions to automatically cite all packages used in the R-session, and
add their references to the bibliography file.
Data sharing is important for computational reproducibility and secondary analysis. Computational reproducibility means that a third party can exactly recreate the results from the original data, using the published analysis code. Secondary analysis means that a third party can conduct sensitivity analyses, explore alternative explanations, or even use existing data to answer a different research question. From an open science perspective, data sharing is always desirable. From a practical point of view, it is not always possible. Data sharing may be impeded by lack of storage space, legal constraints, ethical restrictions, absence of informed consent for data sharing, or privacy concerns. For example, one concern that applies when working with human participant data, is pseudonimization. Many legislatures require researchers to 'de-identify' human subject data upon collection, by removing or deleting any sensitive personal information and contact details. However, when researchers have an imperfect understanding of the legal obligations and dispensations, they can end up over- or undersharing, and risk either placing participants at risk of being identified, or undermine the replicability of their own work and potential reuse value of their data [@phillipsDiscombobulationDeidentification2016].
Once data are de-identified or pseudononimized, they can often be shared if participants have provided informed consent to that effect. Nevertheles, it is important to note that the European GDPR prohibits storing 'personal data' (information which can identify a natural person whether directly or indirectly) on a server outside the EU, unless it offers an "adequate level of protection". Although different rules may apply to pseudonimized data, there are many repositories that are GDPR compliant, such as the European servers of the Open Science Framework. Different Universities, countries, and funding bodies also have their own repositories that are complient with local legistation. Before sharing any human participant data, it is recommended to obtain approval from an internal (ethical) review board, guidance from Research Data Management Support, and informed consent from participants.
If data can be shared openly, they can simply be pushed to a public 'Git' repository, along with the analysis code. This way, others can download the entire repository and reproduce the analyses from start to finish. If data cannot be shared, researchers should aim to safeguard the potential for computational reproducibility and secondary analysis as much as possible. WORCS recommends two solutions to accomplish this goal. The first solution is to publish a checksum of the original data file [@rivest1992md5]. Think of a checksum as a 32-character summary, or as a "one word" description, of the contents of a file. Any change to the file will result in a different checksum^[It is theoretically possible but improbable that random changes to a file will result in the same checksum.]. Thus, one can use a checksum to verify the identity of a file, and ensure that its contents are unchanged. When data cannot be shared, the risk of fraud or abuse can be mitigated by publishing a checksum for the original data, in addition to the complete analysis code. Using the checksum to verify the identity of a private dataset, researchers can prove to an independent auditor that running the public analysis code on the private data results in the published results of their work.
The second solution is to share a synthetic dataset with similar characteristics
to the real data. Synthetic data mimic the level of measurement and
(conditional) distributions of the real data
[see @nowokSynthpopBespokeCreation2016]. Sharing synthetic data allows any third
party to 1) verify that the published code works, 2) debug the code, and 3)
write valid code for alternative analyses. It is important to note that complex
multivariate relationships present in the real data are often lost in synthetic
data. Thus, findings from the real data might not be replicated in the synthetic
data, and findings in the synthetic data should not be substantively interpreted.
Still, sharing synthetic data facilitates data requests from third parties. A
third party can write analysis code based on the synthetic data, and send it to
the authors who evaluate it on the real data and send back the results. The
worcs
package offers a simple but flexible function to generate synthetic data,
synthetic()
, which by default uses random forests to generate a model for
synthesis (see @wrightRangerFastImplementation2015). For data synthesis to work
properly, it is essential that the class
(data type) of variables is defined
correctly. As of worcs
version r worcs:::cran_version()
, numeric, integer, factor, and logical data
are supported out of the box. Other types of variables should be converted to
one of these types, or users can use a custom model_expression
when calling
synthetic
.
When initializing a new worcs
project, an empty R-script called prepare_data.R
is generated.
As soon as raw data are collected, researchers should use this file to document
all steps necessary to load the
data into R, pseudonimize it, and prepare it for analysis.
As data should be shared in a format as close to raw as possible,
this script should
be as short as possible and only document the minimum necessary steps.
For example, if the data was originally in SPSS format with IP addresses and GPS
location, this file might just contain the code required to read the SPSS file,
and to remove those columns of sensitive personal information. As soon as the
data are pseudonimized and processed into tabular (spreadsheet-like) format,
the researcher should
version control some indelible record of the raw data.
The concept of 'tidy' data
is a helpful standard; i.e., clean, tabular data that is formatted in a way that facilitates further analysis [see @wickhamTidyData2014].
One issue of concern is the practice of "fixing" mistakes in the data.
It is not uncommon for researchers to perform such corrections manually in the raw
data,
even if all other analysis steps are documented.
Regardless of whether the data will be open or closed, it is important that the
raw data be left unchanged as much as possible. This eliminates human error.
Any alterations to the data - including processing steps and even error
corrections - should be documented in the code, instead of applied to the raw data.
This way, the code will be a pipeline from the raw data to the published results.
Thus, for instance, if data have been entered manually in a spreadsheet, and
a researcher discovers that in one of the variables, missing values were coded as
-99
whereas they were coded as -9
in all other variables, then we would recommend adding
a line of code to recode the missing values - thereby documenting the mistake -
instead of "fixing" it by editing the raw spreadsheet.
The worcs
package offers two functions for version controlling a record of the
data: One for open, and one for closed data. Researchers should assume that the
decision to make data open is irreversible. Thus, the decision should be made well
before arriving at this stage in the workflow. If there is any doubt, it is prudent
to proceed with closed data and make the data open once all doubt is assuaged.
In worcs
, researchers can call the function open_data(data)
to make the data
publicly available. This function stores the data
object (such as a data.frame
or
matrix
) in a tabular data file (called data.csv
by default).
It also generates an
accompanying codebook -
a human-readable document that serves as a legend for the tabular data file -
that can be viewed online.
All changed files should now be added to the 'Git' repository, committed, and
pushed to the remote repository.
This can be done through the 'Git' panel in RStudio, or by running the worcs
function git_update("commit message")
.
Once the remote repository is made public, these data will be
open to the public.
Assume that, once this is done, the data cannot be un-shared.
Alternatively, if the project requires data to remain closed, researchers can
call closed_data(data)
. This function also stores the data in a local .csv
file
and generates a codebook,
but the original data file is added to .gitignore
so it cannot be
accidentally added to the 'Git' repository.
The function also computes a checksum for the original data, and logs it in the
.worcs
project file. This checksum serves as the indelible record of the state
of the raw data.
The function closed_data()
also creates a synthetic dataset (called
synthetic_data.csv
by default).
The user should push all changed files to the remote repository.
Once the remote repository is made public, people will have access to a
checksum for the original data that exist on your local machine, a codebook
describing those data, and a synthetic copy of the data.
Keep in mind that, when using closed data, the original data file exists only on
the user's computer. Adequate provisions to back up the data, while respecting
principles of responsible data stewardship, should be made.
As the purpose of the prepare_data.R
file is to prepare data for sharing,
the final line of this file
should typically be either
open_data()
or closed_data()
.
After generating tidy, shareable data, this data is then loaded in the analysis code with the function load_data()
.
This function only loads the real data if it is available on the user's computer,
and otherwise, loads the synthetic data.
This will have the effect that third party
users who copy a remote repository with closed data will automatically load the synthetic
dataset, whereas the study authors, who have the real data stored locally, will
automatically load the real data.
This makes it possible for reviewers,
coauthors, and auditors, to write analysis scripts without requiring access to
the original data.
They can simply start the script with load_data()
, and
write their code based on the synthetic data.
Then, they can submit their code
to you - by email or as a
"Pull request".
When you run their code on your system, load_data()
will load the
original data, and use that to run their script. You can then return the results
to the third party.
Note that these functions can be used to store multiple data files, if necessary.
As of version 0.1.2, the worcs
package only stores data as .csv
files.
As explained above, this is recommended, because .csv
files are human- and
machine readable, and because data typically need to be in tabular format
for analysis. Many types of data can be represented as a table; for example,
text corpora can have one row per document, and EEG- or ECG waveforms can have a
row per measurement occasion and a column per channel.
The prepare_data.R
file should document any
steps required to convert these data into tabular format.
If this is not an appropriate format for the data, readers are encouraged to follow the
development of the repro
package [@repro], which will offer greater support for
reproducibly storing and retrieving diverse data formats.
Some users may intend to make their data openly available through a restricted access platform, such as an institutional repository, but not publicly through a 'Git' remote repository.
In this case, it is recommended to use closed_data()
, and to manually upload the original data.csv
file to the restricted access platform. If users wish to share their data through the Open Science Framework, it is sufficient to connect the OSF page to the 'Git' remote repository as an Add-on.
When writing a manuscript created using dynamic document generation, analysis code is embedded in the prose of the paper. Thus, the TOP-guideline of sharing analysis code can be met simply by committing the source code of the manuscript to the 'Git' repository, and making this remote repository Public. If authors additionally use open data and a reproducible environment (as suggested in WORCS), then a third party can simply replicate all analyses by copying the entire repository from the 'Git' hosting service, and Knitting the manuscript on their local computer.
Aside from analysis code, the TOP-guidelines also encourage sharing new research materials, and details of the study design and analysis. These goals can be accomplished by placing any such documents in the 'Git' repository folder, committing them, and pushing to a cloud hosting service. As with any document version controlled in this way, it is advisable (but not required) to use plain text only.
Lindsay and colleagues [-@lindsayResearchPreregistration1012016] define preregistration as "creating a permanent record of your study plans before you look at the data. The plan is stored in a date-stamped, uneditable file in a secure online archive." Two such archives are well-known in the social sciences: AsPredicted.org, and OSF.io. However, 'Git' cloud hosting services also conform to these standards. Thus, it is possible to preregister a study simply by committing a preregistration document to the local 'Git' repository, and pushing it to the remote repository. Subsequently taging the release as "Preregistration" on the remote repository renders it distinct from all other commits, and easily findable by people and programs. This approach is simple, quick, and robust. Moreover, it is compatible with formal preregistration through services such as AsPredicted.org or OSF.io: If the preregistration is written using dynamic document generation, as recommended in WORCS, this file can be rendered to PDF and uploaded as an attachment to the formal preregistration service.
The advantages, disadvantages, and pitfalls for preregistering different types of studies have been extensively debated elsewhere [see @lindsayResearchPreregistration1012016]. For example, because the practice of preregistration has historically been closely tied to experimental research, it has been a matter of some debate whether secondary data analyses can be preregistered [but see @westonRecommendationsIncreasingTransparency2019 for an excellent discussion of the topic].
When analyzing existing data, it is difficult to prove that a researcher did not have direct (or indirect, through collaborators or by reading studies using the same data) exposure to the data, before composing the preregistration. However, the question of proof is only relevant from a perspective of preventing scientific misconduct. Preregistration is a good way to ensure that a researcher is not "HARKing": Hypothesizing after the results are known [@kerrHARKingHypothesizingResults1998].
Good faith preregistration efforts always improve the quality of deductive (theory-testing) research, because they avoid HARKing, ensure reliable significance tests, avoid overfitting noise in the data, and limit the number of forking paths researchers wander during data analysis [@gelmanStatisticalCrisisScience2014].
WORCS takes the pragmatic position that, in deductive (hypothesis-testing) research, it is beneficial to plan projects before executing them, to preregister these plans, and adhere to them. All deviations from this procedure should be disclosed. Researchers should minimize exposure to the data, and disclose any prior exposure, whether direct (e.g., by computing summary statistics) or indirect (e.g., by reading papers using the same data). Similarly, one can disclose any deviations from the analysis plan to handle unforeseen contingencies, such as violations of model assumptions; or additional exploratory analyses.
WORCS recommends documenting a verbal, conceptual description of the study plans
in a text-based file. Optionally, an analysis script can be preregistered to
document the planned analyses.
If any changes must be made to the analysis code after obtaining the data,
one can refer to the conceptual description to justify the changes.
The ideal preregistered analysis script consists of a complete analysis
that can be evaluated once the data are obtained.
This ideal is often unattainable, because the data present researchers with
unanticipated challenges;
e.g., assumptions are violated, or analyses work differently than expected.
Some of these challenges can be avoided by simulating the data one expects to obtain,
and writing the analysis syntax based on the simulated data.
This topic is beyond the scope of the present paper, but many user-friendly
methods for simulating data are available in most statistical programming languages.
For instance, R
users can use the package simstudy
[@goldfeldSimstudySimulationStudy2020].
As soon as a project is preregistered on a Public 'Git' repository, it is visible to the world, and reviewers (both formal reviewers designated by a journal, and informal reviewers recruited by other means) can submit comments, e.g., through "Pull requests" on a 'Git' remote repository. If the remote repository is Private, Reviewers can be invited as "Collaborators". It is important to note that contributing to the repository in any way will void reviewers' anonymity, as their contributions will be linked to a user name. Private remote repositories can be made public at a later date, along with their entire time-stamped history and tagged releases.
worcs
The worcs
workflow facilitates preregistration,
primarily by explicitly inviting a user to write the preregistration document.
To this end, worcs
imports several preregistration templates from the R package
prereg
[@austPreregMarkdownTemplates2019], including templates from
organizations like AsPredicted.org and OSF.io, and
from researchers [e.g., @vantveerPreregistrationSocialPsychology2016].
When initializing an RStudio project with the worcs
project template, one of
these preregistration templates can be selected, which will generate a file
called preregistration.Rmd
. This file should be used to document study plans,
ideally prior to data collection.
Within the workflow, the 'Git' cloud hosting services can be used to preregister a study simply by pushing a text document with the study plans to a 'Git' remote repository.
By tagging the commit as a release, labelled "preregistration"
(see the workflow vignette),
it is distinct from all other commits, and easily findable by people and machines.
WORCS was designed to meet the TOP-guidelines, and it does so in a way that is compatible with other open science principles. Most notably, WORCS is amenable to the FAIR Guiding Principles for scientific data management and stewardship [@wilkinsonFAIRGuidingPrinciples2016a]. These principles advocate that digital research objects should be Findable, Accessible, Interoperable and Reusable. Initially mainly promoted as principles for data, they are increasingly applied as a standard for other types of research output as well, most notably software [@lamprechtFAIRPrinciplesResearch2019].
WORCS helps users meet the FAIR principles by making all created research objects -
including data and code - accessible in a repository that can be made public.
The worcs
package additionally
facilitates meeting the FAIR principles in several ways. First, each worcs
project has a YAML file (https://yaml.org/) that lists the research objects in
the project, which will allow web crawlers
to index worcs
projects. Second, a codebook is automatically generated for
each data file stored
using open_data()
or closed_data()
. We encourage users to elaborate on the
default codebook by adding
variable description and categories, which would - in the future - enable
indexing data by topic area.
Incidentally, worcs
itself follows the recommendations for FAIR research software [see https://fair-software.eu/]. It is
hosted on a version controlled public remote repository 'GitHub', has an open
source licence (GPL v3.0), is registered in a community registry (CRAN),
enables the citation of the software (using the citation("worcs")
command, or
by citing this paper), and followed the CII Best Practices
software quality checklist during development.
Sharing all research objects, as advocated in WORCS, also provides a proper basis for research evaluation according to the San Francisco Declaration on Research Assessment (DORA; https://sfdora.org/), which plays an increasing role in grant funding, hiring, and promotion procedures. Direct access to research objects allows stakeholders to evaluate research quality based on content rather than relying on spurious surrogate indicators like journal impact factors, conference rankings, and h-indexes. The detailed version control and commit tracking of 'Git' remote repositories furthermore make it possible to assess the relative contributions made by different researchers.
In this tutorial paper, we have presented a workflow for open reproducible code in science. The workflow aims to lower the threshold for grass-roots adoption of open science principles. The workflow is supported by an R package with an RStudio project template and convenience functions. This relatively light-weight workflow meets most of the requirements for open science as detailed in the TOP-guidelines, and is compatible with other open science guidelines. The workflow helps researchers meet existing requirements for open science, and to reap the benefits of working openly and reproducibly, even when such requirements are absent.
There have been several previous efforts to promote grass-roots adoption of open
science principles. Each of these efforts has a different scope, strengths, and
limitations that set it apart from WORCS. For example, there are "signalling
solutions"; guidelines to structure and incentivize disclosure about open
science practices. Specifically, @aalbersbergMakingScienceTransparent2018
suggested publishing a "TOP-statement" as supplemental material, which discloses
the authors' adherence to open science principles. Relatedly, @aczelConsensusbasedTransparencyChecklist2019 developed a consensus-based
Transparency Checklist that authors can complete online to generate a report.
Such signalling solutions are very easy to adopt, and they address
TOP-guidelines 1-7. Many journals now also offer authors the opportunity to earn
"badges" for adhering to open science guidelines
[@kidwellBadgesAcknowledgeOpen2016]. These signalling solutions help structure
authors' disclosures about, and incentivize adherence to, open science practices.
WORCS, too, has its own checklist that calls attention to a number of concrete
items contributing to an open and reproducible research project. Users of the
worcs
package can receive a badge, displayed on the README.md
file,
based on a semi-automated scoring of the checklist items, by calling the function
check_worcs()
within a worcs
project directory.
A different class of solutions instead focuses on the practical issue of how researchers can meet the requirements of open science. One notable example is the workflow for reproducible analyses developed by @peikertReproducibleDataAnalysis2019. This workflow strives to ensure strict computational reproducibility for even the most sophisticated analyses. It has some limitations, however: In focusing on reproducibility, it only addresses TOP-guidelines 2, 3, and 5. Moreover, in aiming for strict computational reproducibility it ends up being relatively complex to implement. WORCS builds upon the same general principles as Peikert and Brandmaier, which means the two workflows are compatible. What sets WORCS apart is that is an easy and sufficient solution for projects that can be conducted entirely within R, resulting in a much more lightweight approach to safeguard computational reproducibility under most circumstances. WORCS also addresses a unique issue not covered by other existing solutions, namely to provide a workflow most conducive to satisfying the TOP-guidelines and FAIR principles, while being compatible with existing requirements.
One initiative that WORCS is fully compatible with, is the Scientific Paper of the Future. This organization encourages geoscientists to document data provenance and availability in public repositories, document software used, and document the analysis steps taken to derive the results. All of these goals could be met using WORCS.
WORCS is intended to substantially reduce the threshold for adopting best practices in open and reproducible research. However, several limitations and issues for future development remain. One potential challenge is the learning curve associated with the tools outlined in this paper. Learning to work with R, RMarkdown, and Git requires an initial time investment. The amount of time required is reduced by the availability of tutorials, such as this one, and other tutorials cited throughout this document. Moreover, the time investment tends to pay off. Working with R opens the door to many cutting edge analysis techniques. Working with RMarkdown saves time and prevents mistakes by avoiding tedious copying of results into a text document. Working with Git keeps projects organized, prevents accidental loss of work, enables integrating changes by collaborators in a non-destructive manner, and ensures that entire research projects are archived and can be accessed or copied by third parties. Thus, the time investment is eminently worthwhile.
Another important challenge is managing collaborations when only the lead author
uses RMarkdown, and the coauthors use Word. In this case, it is possible to Knit
the manuscript to Word (.docx), by changing the line output: papaja::apa6_pdf
in the manuscript's YAML header to output: papaja::apa6_docx
. There are some
limitations to the conversion, discussed here.
When soliciting feedback from co-authors, ask them to use Track Changes and
comment bubbles in Word. Then, manually incorporate their
changes into the manuscript.Rmd
file. In most cases, this is the most
user-friendly approach, as most lead authors would review changes by co-authors
anyway. A second approach is to ask collaborators to work in plain text. In this
case, send collaborators the manuscript.Rmd
file, and ask them to open (and
save) it in Word or Notepad as a plain text file. When they send it back, make
sure any changes to your local file are committed, and then simply overwrite
your local version with their file. In RStudio, select the file in the Git tab,
and click the Diff button to examine what changes the collaborator has made
relative to your last committed version. Finally,
@peikertReproducibleDataAnalysis2019 mention the redoc
package [@rossRedocReversibleReproducible2020] as a
potentially useful way to collaborate with Word users. This package enables
bidirectional conversion between Word and
RMarkdown, with support for tracked changes. However, redoc
is currently in
suspended development, and for many users, a simple solution (i.e.,
asking collaborators to work in plain text) might be sufficient.
If all collaborators are committed to using WORCS, they can Fork the repository from the lead author on GitHub, clone it to their local device, make their own changes, and send a pull request to incorporate their changes. Working this way is extremely conducive to scientific collaboration [@ramGitCanFacilitate2013]. Recall that, when using 'Git' for collaborative writing, it is recommended to insert a line break after every sentence so that the change log will indicate which specific sentence was edited, and to prevent "merge conflicts" when two authors edit the same line. The resulting document will be rendered to PDF without spurious line breaks.
Being familiar
with 'Git' remote repositories opens doors to new forms of collaboration:
In the open source
software community, continuous peer review and voluntary collaborative acts by
strangers who are interested in a project are commonplace
[see @adolphOpenBehavioralScience2012]. This kind of collaboration is budding
in scientific software development as well; for example, the lead author of this
paper became a co-author on several R packages after submitting pull requests
with bug fixes or additional functionality [@rosenbergTidyLPAPackageEasily2018; @hallquistMplusAutomationPackageFacilitating2018], and two co-authors of this
paper became involved by contributing pull requests to worcs
. It is also
possible to invite such collaboration by
opening Issues
for tasks that still need to be accomplished, and tag known collaborators to
address them, or invite external collaborators to contribute their expertise.
WORCS provides a user-friendly and
lightweight workflow for open, reproducible research, that meets all
TOP-guidelines, except replication.
Nevertheless, there are clear directions for future developments.
Firstly, although the workflow is currently implemented only in R, it is
conceptually relevant for users of other statistical programming languages, most
notably Python [@10.5555/1593511].
We welcome efforts to implement WORCS in other platforms.
Secondly, even when a project is in R, it may have dependencies outside of the
R environment that cannot be managed using renv
.
It is beyond the scope of the worcs
package to support the incorporation of
tools outside of R, or to containerize a project so that it can
be identically reinstated on a different system or virtual machine.
Therefore, another
important area of future development is to enable such extensions of the
workflow. To this end, we are collaborating with @repro on the development of
repro
, an R package that will facilitate setting up reproducibility tools such
as 'Make' and 'Docker'.
This will make it possible for researchers
to enhance a WORCS project with these strict reproducibility tools,
or vice versa, to initiate a WORCS project within a strictly reproducible
environment generated using repro
.
This will combine the strengths of WORCS with those of the workflow proposed
by @peikertReproducibleDataAnalysis2019.
WORCS offers a workflow for open reproducible code in science. The step-by-step procedure outlined in this tutorial helps researchers make an entire research project Open and Reproducible. The accompanying R package provides user-friendly support functions for several steps in the workflow, and an RStudio project template to get the project set up correctly.
WORCS encourages and simplifies the adoption of Open Science principles in daily scientific work. It helps researchers make all research output created throughout the scientific process - not just manuscripts, but data, code, and methods - open and publicly assessible. This enables other researchers to reproduce results, and facilitates cumulative science by allowing others to make direct use of these research objects.
\newpage
\begingroup \setlength{\parindent}{-0.5in} \setlength{\leftskip}{0.5in} \setlength{\parskip}{0em}
\endgroup
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.