Science is one of humanity's greatest inventions. Academia, on the other hand, is not. It is remarkable how successful science has been, given the often chaotic habits of scientists. In contrast to other fields, like say landscaping or software engineering, science as a profession is largely unprofessional - apprentice scientists are taught less about how to work responsibly than about how to earn promotions. This results in ubiquitous and costly errors. Software development has become indispensable to scientific work. I want to playfully ask how it can become even more useful by transferring some aspects of its professionalism, the day-to-day tracking and back-tracking and testing that is especially part of distributed, open-source software development. Science, after all, aspires to be distributed, open-source knowledge development.
"Science as Amateur Software Development" @richardmcelreathScienceAmateurSoftware2020
https://youtu.be/zwRdO9_GGhY
McElreath's words are as enlightening as always. Usually, researchers start their academic careers led by their great interest in a specific scientific area. They want to answer some specific research question, but these questions quickly turn into data, statistical analysis, and lines of code, hundreds of lines of code. Most researchers, however, receive essentially no training about programming and software development good practices resulting in very chaotic habits that can lead to costly errors. Moreover, bad practices may hinder the transparency and reproducibility of the analysis results.
Thanks to the Open Science movement, transparency and reproducibility are recognized as fundamental requirements of modern scientific research. In fact, openly sharing study materials and analyses code are prerequisites for allowing results replicability by new studies. Note the difference between replicability and reproducibility [@nosekWhatReplication2020]:
So, reproducibility simply means re-running someone else's code on the same data to obtain the same result. At first, this may seem a very simple task, but actually, it requires properly organising and managing all the analysis material. Without adequate programming and software development skills, it is very difficult to guarantee the reproducibility of the analysis results.
The present book aims to describe programming good practices and introduce common tools used in software development to guarantee the reproducibility of analysis results. Inspired by Richard McElreath's talk, we want to make scientific research an open-source knowledge development.
The book is structured as follows.
Let's discuss some useful tips about how to get the best out of this book.
This book provides useful recommendations and guidelines that can be applied independently of the specific programming language used. However, examples and specific applications are based on the R programming languages.
In particular, each chapter first provides general recommendations and guidelines that apply to most programming languages. Subsequently, we discuss specific tools and applications available in R.
In this way, readers working with programming languages other than R can still find valuable guidelines and information and can later apply the same workflow and ideas using dedicated tools specific to their preferred programming language.
To guarantee results replicability and project maintainability, we need to follow all the guidelines and apply all the tools covered in this book. However, if we are not already familiar with all these arguments, it could be incredibly overwhelming at first.
Do not try to apply all guidelines and tools all at once. Our recommendation is to build our reproducible workflow gradually, introducing new guidelines and new tools step by step at any new project. In this way, we have the time to learn and familiarize ourselves with a specific part of the workflow before introducing a new step.
The book is structured to facilitate this process, as each chapter is an independent step to build our reproducible workflow:
Learning advanced tools such as Git, pipeline tools, and Docker still requires a lot of time and practice. They may even seem excessively complex at first. However, we should consider them as an investment. As soon as our analyses will become more complex than a few lines e of code, these tools will allow us to safely develop and manage our project.
Most of the arguments discussed in this book are the A-B-C of the daily workflow of many programmers. The problem is that most researchers lack any kind of formal training in programming and software development.
The aim of the book is exactly that: to introduce popular tools and common guidelines of software development into scientific research. We try to provide a very gentle introduction to many programming concepts and tools without assuming any previous knowledge. Note, however, that we assume the reader is already familiar with the R programming language for specific examples and applications.
Inside the book, there are special Info-Boxes that provide further details.
:::{.tip title=" " data-latex="[~]"} Tip-Boxes are used to provide insight into specific topics. :::
:::{.warning title=" " data-latex="[~]"} Warning-Boxes are used to provide important warnings. :::
:::{.deffun title=" " data-latex="[~]"} Instructions-Boxes are used to provide detailed instructions. :::
:::{.design title=" " data-latex="[~]"} Details-Boxes are used to provide further details about advanced topics. :::
:::{.trick title=" " data-latex="[~]"} Trick-Boxes are used to describe special useful tricks. :::
:::{.foil title=" " data-latex="[~]"} Command Cheatsheets are used to summarize commands of a specific software. :::
Moreover, at the end of each chapter, we list all useful links to external documentation in a dedicated box.
:::{.doclinks data-latex=""} Documentation-Boxes are used to collect all useful links to external documentation. :::
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.