Quotes from research that we haven't used yet, but might want to as we edits.
"Many journals provide mechanisms to make reproducibility possible, including PLoS, Nature, and Science. This entails ensuring access to the computer code and datasets used to produce the results of a study. In contrast, replication of scientific findings involves research conducted by independent researchers using their own methods, data, and equipment that validate or confirm the findings of earlier studies. Replication is not always possible, however, so reproducibility is a minimum and necessaray standard for confirming scientific findings." [@keller2017evolution]
"Reproducibility goes well beyond validating statistical results and includes empirical, computational, ethical, and statistical analyses. For example, empirical reproducibility ensures that the same results are obtained from the data and code used in the original study, and statistical reproducibility focuses on statistical design and analysis to ensure replication of an experiment. There are also definitions of ethical reproducibility, such as documenting the methods used in biomedical research or in social and behavioral science research so others can reproduce algorithms used in analysis." [@keller2017evolution]
"Many studies have been undertaken to understand the reproducibility of scientific findings and have come to different conclusions about the findings. For example, one scientist argues that half of all scientific discoveries are false (Ioannidis 2005), others find that a large portion of the reproduced findings produce weaker evidence compared with the original findings (Nosek et al. 2015), and others find that 4/5 of the results are true positives (Jager and Leek 2013). ... Despite this controversy, the premise underlying reproducibility is data quality in the form of good experimental design and execution, documentation, and making scientific inputs available for reproducing the scientific work." [@keller2017evolution]
"In the computer science, engineering, and business worlds, data quality management focuses largely on administrative data and is driven by the need to have accurate, reliable data for daily operations. The kinds of data traditionally discussed in this data quality literature are fundamental to the functioning of an organization---if the data are bad, ifrms will lose money or defective products will be manufactured. The advent of data quality in the engineering and business worlds traces back to the 1940s and 1950s with Edward Deming and Joseph Juran. Japanese companies embraced these methods and transformed their business practices using them. Deming's approach used statistical process control that focused on measuring inputs and processes and thus minimized product inspections after a product was build." [@keller2017evolution]
[In business, ] "Data quality is further defined from the perspective of the ease of use of the data with respect to the integrity, accuracy, interpretability, and value assessed by the data user and other attributes that make the data valuable." [@keller2017evolution]
"Many scientists spend a lot of time using Excel, and without batting an eye will change the value in a cell and save the results. I strongly discourage modifying data this way. Instead, a better approach is to treat all data as read-only and only allow programs to read data and create new, separate files of results. Why is treating data as read-only important in bioinformatics? First, modifying the data in place can easily lead to corrupted results. For example, suppose you wrote a script that directly modifies a file. Midway through processing a large file, your script encounters an error and crashes. Because you've modified the original file, you can't undo the changes and try again (unless you have a backup)! Essentially, this file is corrupted and can no longer be used. Second, it's easy to lose track of how we've changed a file when we modify it in place. Unlike a workflow where each step has an input file and an output file, a file modified in place doesn't give us any indication of what we've done to it. Were we to lose track of how we've changed a file and don't have a backup copy of the original data, our changes are essentially irreproducible. Treating data as read-only may seem counterintuitive may seem counterintuitive to scientists familiar with working extensively in Excel, but it's essential to robust research (and prevents catastrophe, and helps reproducibility). The initial difficulty is well worth it; it also fosters reproducibility. Additionally, any step of the analysis can easily be redone, as the input data is unchanged by the program." [@buffalo2015bioinformatics]
"'Plain text' data files are encoded in a format (typically UTF-8) that can be read by humans and computers alike. The great thing about plain text is their simplicity and their ease of use: any programming language can read a plain text file. The most common plain text format is .csv, comma-separated values, in which columns are separated by commas and rows are separated by line breaks." [@gillespie2016efficient]
Designed data is "data that have traditionally been used in scientific discovery. Designed data include statistically designed data collections, such as surveys or experiments, and intentional observational collections. Examples of intentional observational collections include data obtained from specially designed instruments such as telescopes, DNA sequencers, or sensors on an ocean buoy, and also data from systematically designed case studies such as health registries Researchers have frequently devoted decades of systematic research to understanding and characterizing the properties of designed data collections." This contrasts with administrative data and opportunity data. [@keller2017evolution]
"The need to address data quality is a persistent one in the physical and biological sciences, where scientists often seek to understand subtle effects that leave minute traces in large volumes of data. ... For most scientists, three factors motivate their work on data quality: first, the need to create a strong foundation of data from which to draw their own conclusions; second, the need to protect their data and conclusions from the criticisms of others; and third, the need to understand the potential flaws in data collected by others. The work of these scientists in data quality primarily concentrates on the design and execution of experiments, including in laboratory, field, and clinical settings. The key ingredients are measurement implementation, laboratory and experimental controls, documentation, analysis, and curation of data." [@keller2017evolution]
"The concept of data quality management developed in the 1980s in step with the technological ability to access stored data in a random fashion. Specifically, as data management encoding process moved from the highly controlled and defined linear process of transcribing information to tape, to a system allowing for the random updating and transformation of data fields in a record, the need for a higher level of control over what exactly can go into a data field (including type, size, and values) became evident. Two key data quality concepts came from these data management advances---ensuring data integrity and cleansing the legacy data. Data integrity refers to the rules and processes put in place to maintain and assure the accuracy and consistency of a system that stores, processes, and retrieves data. Data cleaning refers to the identificatio of incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting this so-called dirty or coarse data." [@keller2017evolution]
"As the capability to store increasing amounts of data grew, so did the business motivation to improve the quality of administrative data and thereby improve decision making, reduce costs, and gain trust of customers." [@keller2017evolution]
"Determine whether there is a community-based metadata schema or standard (i.e., preferred sets of metadata elements) that can be adopted." [@michener2015ten]
Might make more sense in tidy data section: "Although many standards have been defined for data and model representations, they only ensure that data and models that comply with these standards can be used by software that support these standards; they do not ensure that multiple software tools can be used seamlessly. When software tools are developed by independent research groups or companies without an explicit agreement as to how they can be integrated, this can cause problems when forming a workflow of multiple tools. This is because the tools are likely to be inconsistent in their operating procedures and their use of various non-standardized data formats. Thus, users often have to convert data formats, to learn operating procedures for each tool, and sometimes even to adjust operating environments. This impedes productivity, undermines the flexibility of the workflow, and is prone to errors." [@ghosh2011software]
"Ontologies define the relationships and hierarchies between different terms and allow the unique, semantic annotation of data." [@ghosh2011software]
"There are several international and national bodies, such as OMG, W3C, IEEE, ANSI, and IETF... that formally approve standards or provide a framework for standards development. Although some of the systems biology standards have been certified (for example, SBML is officially adapted by IETF), in general in life sciences this procedure is not particularly important---many of the most successful standards such as GO have not undergone any official approval procedure, but instead have become de facto standards. In fact, many of the most successful standards in other domains are de facto standards." [@brazma2006standards]
"There are four steps involved in developing a complete and self-contained standard: conceptual model design, model formalization, development of a data exchange format, and implementation of the supporting tools." [@brazma2006standards]
"Two competing goals should be balanced when developing the conceptual model of a domain: domain specificity and the need to find the common ground for all related applications. Arguably, the most useful standards are those that consist of the minimum number of most informative parameters. Keeping the list short makes it simple and practicable, while selecting the most informative features ensures accuracy and efficiency. The need for minimalism is reflected by the titles for many such standards---Minimum Information About XYZ."
"For a standard to be successful, laboratory information management systems (LIMS), databases and data analysis, and modeling tools shoudl comply with it. One way of fostering this is to develop easy-to-use 'middleware'---software components that hide the technical complexities of the standard and facilitate manipulation of the standard format in an easy way." [@brazma2006standards]
"Semantics: The meaning of something; in computer science, it is usually used in opposition to syntax (that is, format)." [@brazma2006standards]
"Excel is by far the most common spreadsheet software, but many other spreaadsheet programs exist, notably the Open Office Calc software, which is an open source alternative and stores spreadsheets in an XML-based open standard format called Open Document Format (ODF). This allows the data to be accessed by a wider variety of software. However, even ODF is not ideal is a storage format for aa research data set because spreadsheet formats contain not only the data that is stored in the spreadsheet, but also information about how to display the data, such as fonts and colors, borders and shading." [@murrell2009introduction]
"Spreadsheet software is useful because it displays a data set in a nice rectangular grid of cells. The arrangement of cells into distinct columns shares the same benefits as fixed-width format text files: it makes it very easy for a human to view and navigate within the data. ... because most spreadsheet software provides facilities to import a data set from a wide variety of formats, these benefits of the spreadsheet software can be enjoyed without having to suffer the negatives of using a spreadsheet format for data storage. For example, it is possible to store the data set in a CSV format and use spreadsheet software to view or explore the data." [@murrell2009introduction]
"Thinking about what sorts of questions will be asked of the data is a good way to guide the design of data storage." [@murrell2009introduction]
First, you can still use spreadsheets, but reduce their use to recording data, leaving all data cleaning and analysis to be handled with other software. To make it easier to collaborate with statisticians and to interface with a program like R for data cleaning and analysis, it will be easiest if you set up your data recording to include with other statistical programs like R or Python. These steps are described in a later section, "...".
Next, you could record data using a statistical language like R. There is an excellent Integrated Development Environment for R called RStudio, and it creates a much clearer interface with R compared to running R from a commond line, particularly for new users. RStudio allows you to open delimited plain text files, like csvs, using a grid-style interface. This grid-style interface looks very similar to a spreadsheet, but lacks the ability to include formulas or macros. Therefore, this format enforces a separation of the recording of raw data from the cleaning and analysis of the data.
[R Project templates]
Data cleaning and analysis can then be shifted away from the files used to record the data and into reproducible scripts. These scripts can be clearly documented, either through comments in the code or through open source documentation tools like RMarkdown than interweave code and text in a way that allows the creation of documents that are easier to read than commented code.
This documentation should explain why each step is being done. In cases where it is not immediately evident from the code how the step is being done, this should be documented as well. Any assumptions being used should be clarified in the documentation.
"When we have only the letters, digits, special symbols, and punctuation marks that appear on a standard (US) English keyboard, then we can use a single byte to store each character. This is called an ASCII encoding (American Standard Code for Information Exchange). ... UNICODE is an attempt to allow computers to work with all of the characters in all of the languages of the world. Every character has its own number, called a 'code point', often written in the form U+xxxxxx, where every x is a hexidecimal digit. ... There are two main 'encodings' that are used to store a UNICODE code point in memory. UTF-16 always uses two bytes per character of text and UTF-8 uses one or more bytes, depending on which characters are stored. If the text is only ASCII, UTF-8 will only use one byte per character. ... This encoding is another example of additional information that may have to be provided by a human before the computer can read data correctly from a plain text file, although many software packages will cope with different encodings automatically." [@murrell2009introduction]
"A PDF document is primarily a description of how to display information. Any data values within a PDF document will be hopelessly entwined with information about how the data values should be displayed." [@murrell2009introduction]
"Another major weakness of free-form text files is the lack of information within the file itself about the structure of the file. For example, plain text files do not contain information about which special character is being used to separate fileds in a delimited file, or any information about the widths of fields within a fixed-width format. This means that the computer cannot automatically determine where different fields are within each row of a plain text file, or even how many fields there are. A fixed-width format avoids this problem, but enforcing a fixed length for fields can create other difficulties if we do not know the maximum possible length for all variables. Also, if the values for a variable can have very different lengths, a fixed-width format can be inefficient because we store lots of empty space for short values. The simplicity of plain text files make it easy for a computer to read a file as a series of characters, but the computer cannot easily distinguish individual data values from the series of characters. Even worse, the computer has no way of tellins what sort of data is stored in each field. ... In practice, humans must suppy additional information about a plain text file before a computer can successfully determine where the different fields are within a plain text file and what sort of data is stored in each field." [@murrell2009introduction]
"In bioinformatics, the plain-text data we work with is often encoded in ASCII. ASCII is a character encoding scheme that uses 7 bits to represent 128 different values, including letters (upper- and lowercase), numbers, and special nonvisible characters. While ASCII only uses 7 bits, nowadays computers use an 8-bit byte (a unit representing 8 bits) to store ASCII characters. ... Because plain-text data uses characters to encode information, our encoding scheme matters. When working with a plain-text file, 98% of the time you won't have to worry about the details of ASCII and how your file is encoded. However, the 2% of the time when encoding data does matter---usually when an invisible non-ASCII character has entered the data---it can lead to major headaches." [@buffalo2015bioinformatics]
"Programs [in Unix] retreive the data in a file by a system call (a subroutine in the kernel) called
read
. Each timeread
is called, it returns the next part of a file---the next line of text typed on the terminal, for example.read
also says how many bytes of the file were returned, so end of file is assumed when aread
says 'zero bytes are being returned'. If there were any bytes left,read
would have returned some of them." [@kernighan1984unix]"The format of a file is determined by the programs that use it; there is a wide variety of file types, perhaps because there is a wide variety of programs. But since file types are not determined by the file system, the kernel can't tell you the type of a file: it doesn't know it. The
file
command makes an educated guess ... To determine the types,file
doesn't pay attention to the names (although it could have), because naming conventions are just conventions, and thus not perfectly reliable. For example, files suffixed '.c' are almost always C source, but there is nothing to prevent you from creating a '.c' file with arbitrary contents. Instead,file
reads the first hundred bytes of a file and looks for clues to that file type. ... In Unix systems there is just one kind of file, and all that is required to access a file is its name. The lack of file formats is an advantage overall---programmers don't need to worry about file types, and all the standard programs will work on any file." [@kernighan1984unix]"The Unix file system is organized so you can maintain your own personal files without interfering with files belonging to other people, and keep people from interfering with you too." [@kernighan1984unix]
"The clinical patient health record is a longitudinal administrative record of an individual's health information: all the data related to an individual's or a population's health. The health record is a set of nonstandardized data that spans multiple levels of aggregation, from a single measurement element (blood pressure) to collections of diagnoses and related clinical observations. This complexity is compounded by the high degree of human interaction involved in the productio of clinical records, including self-reported data, medical diagnosis, and other patient information." [@keller2017evolution]
"Sharing data through repositories enhances both the quality and the value of the data through standardized processes for curation, analysis, and quality control. By allowing broad access to data, these repositories encourage and support the use of previously collected data to test and extend previous results. Data repositories are quite common in science fields such as astronomy, genomics, and earth sciences. ... These repositories have accelerated discovery by expanding the reach of these data to scientists who are not involved in the initial data collection and experiments. Repositories address challenges that affect data quality through governance, interoperability across systems, and costs." [@keller2017evolution]
One example of a repository is "the sharing of cDNA microarray data through research consortia, which has led to a common set of standards and relatively homogeneous data classes. There are many issues with the sharing of these data, which requires the transformation of biologic to numeric data. These issues may include loss of context, such as laboratory processes followed, and therefore lack of information about the quality of the data when they are transformed. To avoid this loss of information, the consortium ensures that documentation is comprehensive so that other researchers can assess the quality of the data and make comparisons with other studies using the same data. This documentation also include information on when incorrect assignments of sequence identity are made so that errors are not perpetuated in other studies." [@keller2017evolution]
"Data exchange format: A file or message format that is formally defined so that software can be built that 'knows' where to find various pieces of information." [@brazma2006standards]
"Everything in the Unix system is a file. That is less of an oversimplification than you might think. When the first version of the system was designed, before it even had a name, the discussions focused on the structure of a file system that would be clean and easy to use. The file system is central to the success and convenience of the Unix system. It is one of the best examples of the 'keep it simple' philosophy, showing the power achieved by careful implementation of a few well-chosen ideas." [@kernighan1984unix]
"A file is a sequence of bytes. (A byte is aa small chunk of information, typically 8 bits long. For our purposes, a byte is equivalent to a character.) No structure is imposed on a file by the system, and no meaning is attaached to its contents---the meaning of the bytes depends solely on the programs that interpret the file. Furthermore, ... this is true not just of disc files but of peripheral devices as well. Magnetic tapes, mail messages, characters typed on the keyboard, line printer output, data flowing in pipes---each of these is just a sequence of bytes as far as the systems and the programs in it are concerned." [@kernighan1984unix]
"The Comma-Separated Value (CSV) format is a special case of a plain text format. Although not a formal standard, CSV files are very common and are a quite reliable plain text delimited format that at least solves the problem of where the fields are in each row of the file. The main rules for the CSV format are: (1) Comma-delimited: Each field is separated by a comma (i.e., the character , is the delimiter).; (2) Double-quotes are special: Fields containing commas must be surrounded by double-quotes ... . (3) Double-quote escape sequence: Fields containing double quotes must be surrounded by double-quotes and each embedded double-quote must be represented using two double quotes ... .; (4) Header information: There can be a single header containing the names of the fields." [@murrell2009introduction]
"A data file metaformat is a set of syntactic and lexical conventions that is either formally standardized or sufficiently well established by practice that there are standard service libraries to handle marshaling and unmarshaling it. Unix has evolved or adopted metaformats suitable for a wide range of applications [including delimiter-separated values and XML]. It is good practice to use one of these (rather than an idiosyncratic custom format) whenever possible. The benefits begin with the amount of custom paarsing and generation code that you may be able to avoid writing by using a service library. But the most important benefit is that developers and even many users will instantly recognize these formats and feel comfortable with them, which reduces the friction costs of learning new programs." [@raymond2003art]
"There are three flavors you will encounter: tab-delimited, comma-separated, and variable space-delimited. Of these three formats, tab-delimited is the most commonly used in bioinformatics. File formats such as BED, GTF/GFF, SAM, tabular BLAST output, and VCF are all examples of tab-delimited files. Columns of a tab-delimited file are separated by a single tab character (which has the escape code \t). A common convention (but not a standard) is to include metadata on the first few lines of a tab-delimited file. These metadata lines begin with # to differentiate them from the tabular dataa records. Because tab-delimated files use a tab to delimit columns, tabs in data are not allowed. Comma-separated values (CSV) is another common format. CSV is similar to tab-delimited, except the delimiter is a comma character. While not a common in bioinformatics, it is possible that the data stored in CSV format contain commas (which would interfere with the ability to parse it). Some variants just don't allow this, while others use quotes around entries that could contain commas. Unfortunately, there's no standard CSV format that defines how to handle this and many other issues with CSV---though some guidelines are given in RFC 4180. Lastly, there are space-delimited formats. A few stubborn bioinformatics programs use a variable number of spaces to separate columns. In general, tab-delimited formats and CSV are better choices than space-delimited formats because it's quite common to encounter data containing spaces." [@buffalo2015bioinformatics]
"There are long-standing Unix traditions aabout how textual data formats ought to look. Most of these derive from one or more of the standard Unix metaformats ... just described [e.g., DSV, XML]. It is wise to follow these conventions unless you have strong and specific reasons to do otherwise. ... (1) One record per newline-terminated line, if possible. This makes it easy to extract records with text-stream tools. For data interchange with other operating systems, it's wise to make your file-format parser indifferent to whether the line ending is LF or CR-LF. It's also conventional to ignore trailing whitespace in such formats; this protects against common editor bobbles. (2) Less than 80 characters per line if possible. This makes the format browseable in an ordinary-sized terminal window. If many records must be longer than 80 characters, consider a stanza format... (3) Use # as an introducer for comments. It's good to have a way to embed annotations and comments in data files. It's best if they're actually part of the file structure, and so will be preserved by tools that know its format. For comments that are not preserved during parsing, # is the conventional start character. (4) Support the backslash convention. The least surprising way to support nonprintable control characters is by parsing C-like backslash escapes ... " [@raymond2003art]
"You can take apart these formats and find out which decisions were made to create them ... even old Microsoft Word, which in a long and painful political bottle, finally settled down and 'opened' its format, countless hundres of pages of documentation defining how words apper, how tables of contents are registered, how all of the things that make up a Word document are to be represented. The Microsoft Office File Formats specifications are of a most disturbing, fascinating quality: one can read through them and think: Yes, I see this, I think I understand. But why? ... Even Word is opened now, just regular XML. Strange XML to be sure. All the codes once hidden are revealed." [@ford2015on]
"We wish to draw a distinction between data that is machine-actionable as a result of specific investment in software supporting that data-type, for example, bespoke parsers that understand life science wwPDB files or space science Space Physics Archive Search and Extract (SPASE) files, and data that is machine-actionable exclusively through the utilization of general-purpose, open technologies. To reiterate the earlier point---ultimate machine actionability occurs when a machine can make a useful decision regarding data that it has not encountered before. This distinction is important when considering both (a) the rapidly growing and evolving data environment, with new technologies and new, more complex data-types continuously being developed, and (b) the growth of general-purpose repositories, where the data-types encounted by an agent are unpredictable. Creating bespoke parsers, in all computer languages, is not a sustainable activity." [@wilkinson2016fair]
"[One] way data can come from the Internet is through a web API, which stands for application programming interface. The number of APIs that are being offered by organizations is growing at an ever increasing rate... Web APIs are not meant to be presented in a nice layout, such as websites. Instead, most web APIs return data in a structured format, such as JSON or XML. Having data in a structured format has the advantage that the data can be easily processed by other tools." [@janssens2014data]
"The pileup format [is] a plain-text format that summarizes reads' bases at each chromosome position by stacking or 'piling up' aligned reads." [@buffalo2015bioinformatics]
"Data compression, the process of condensing data so that it takes up less space (on disk drives, in memory, or across network transfers), is an indespensible technology in modern bioinformatics. For example, sequences from a recent Illumina HiSeq run when compressed with Gzip take up 21,408,674,240 bytes, which is a bit under 20 gigabytes. Uncompressed, this file is a whopping 63,203,414,514 bytes (around 58 gigabytes). This FASTQ file has 150 million 200bp reads, which is 10x coverage of the hexaploid wheat genome. The compression ratio (uncompressed size/ compressed size) of this data is approximately 2.95, which translates to a significant space saving of about 66%. Your own bioinformatics projects will likely contain much more data, especially as sequencing costs continue to drop and it's possible to sequence genomes to higher depth, include more biological replicates or time points in expression studies, or sequence more individuals in genotyping studies. For the most part, data can remain compressed on the disk throughout processing and analysis. Most well-writted bioinformatics tools can work natively with compressed data as input, without requiring us to decompress it to disk first. Using pipes and redirection, we can stream compressed data and write compressed files directly to the disk. Additionally, common Unix tools like cat, grep, and less all have variants that work with compressed data, and Python's gzip module allows us to read and write compressed data from within Python. So while working with large datasets in bioinformatics can be challenging, using the compression tools in Unix and software libraries make our lives much easier." [@buffalo2015bioinformatics]
"Non-text files definitely have their place. For example, very laarge databases usually need extra address information for rapid access; this has to be binary for efficiency. But every file format that is not text must have its own family of support programs to do things that the standard tools could perform if the format were text. Text files may be a little less efficient in maachine cycles, but this must be balanced against the cost of extra software to maintain more specialized formats. If you design a file format, you should think carefully before choosing a non-textual representation." [@kernighan1984unix]
"Out-of-memory approaches [are] computational strategies built arouond storing and working with data kept out of memory on the disk. Reading data from a disk is much, much slower than working with data in memory... but in many cases this is the approach we have to take when in-memory (e.g., loading the entire dataset into R) or streaming approaches (e.g., using Unix pipes ...) aren't appropriate." [@buffalo2015bioinformatics]
"In general, it is possible to jump directly to a specific location within a binary format file, whereas it is necessary to read a text-based format from the beginning and one character at a time. This feature of accessing binary formats is called random access and it is generlaly faster than the typically sequential access of text files." [@murrell2009introduction]
"We often need fast read-only access to data linked to a genomic location or range. For the scale of data we encounter in genomics, retrieving this type of data is not trivial for a few reasons. First the data might not fit entirely in memory, requiring an approach where data is kept out of memory (in other words, on a slow disk). Second, even powerful relational database systems can be sluggish when querying out millions of entries that overlap a specific region---an incrediably common operation in genomics. [BGZF and Tabix] are specifically designed to get around these limitations, allowing fast random-access of tab-delimited genome position data." [@buffalo2015bioinformatics]
"Samtools now supports (after version 1) a new, highly compressed file format known as CRAM. Compressing alignments with CRAM can lead to a 10%--30% filesize reduction compared to BAM (and quite remarkably, with no significant increase in compression or decompression time compared to BAM). CRAM is a reference-based compression scheme, meaning only the aligned sequence that's different from the reference sequence is recorded. This greatly reduces file size, as many sequence may align with minimal difference from the reference. As a consequence of this reference-based approach, it is imperative that the reference is available and does not change, as this would lead to a loss of data kept in the CRAM format. Because the reference is so important, CRAM files contain an MD5 checksum of the reference file to ensure it has not changed. CRAM also has support for multiple different lossy compression methods. Lossy compression entails some information about an alignment and the original read is lost. For example, it's possible to bin base quality scores using a lower resolution binning scheme to reduce the filesize." [@buffalo2015bioinformatics]
"Very often we need efficient random access to subsequences of a FASTA file (given regions). At first glance, writing a script to do this doesn't seem difficult. We could, for example, write a script that iterates through FASTA entries, extracting sequences that overlaps the range specified. However, this is not an efficient method when extracting a few random subsequences. To see why, consider accessing the sequence from position chromosome 8 (123,407,082 to 123,419,742) from the mouse genome. This approach would needlessly parse and load chromosomes 1 through 7 into memory, even though we don't need to extract subsequences from these chromosomes. Reading entire chromosomes from disk and copying them into memory can be quite inefficient---we would have to load all 125 megabytes of chromosome 8 to extract 3.6kb! Extracting numerous random subsequences from a FASTA file can be quite computationaally costly. A common computational strategy that allows for easy and fast random access is indexing the file. Indexed files are ubiquitous in bioinformatics." [@buffalo2015bioinformatics]
"We can avoid needlessly reading the entire file off of the disk by using an index that points to where certian blocks are in the file. In the case of our FASTA file, the index essentially stores the location of where each sequence begins in the file (as well as other necessary information). When we look up a range like chromosome 8 (123,407,082--123,410,744), samtools faidx uses the information in the index to quickly calculate exactly where in the file those bases are. Then, using an operation called a file seek, the program jumps to this exact position (called the offset) in the file and starts reading the sequence. Having precomputed file offsets combined with the ability to jump to those exact positions is what makes accessing sections of an indexed file fast." [@buffalo2015bioinformatics]
"The data revolution within the biological and physical science world is generating massive amounts of data from ... a wide range of ... projects, such as those undertaken at the Large Hadron Collider and genomics-proteomics-metabolomics research." [@keller2017evolution]
"Community standards for data description and exchange are crucial. These facilitate data reuse by making it easier to import, export, compare, combine, and understand data. Standards also eliminate the need for the data creator to develop unique descriptive practices. They open the door to development of disciplinary repositories for specific classes of data and specialized software management tools. GenBank, the US NIH genetic sequence database, and the US National Virtual Observatory are good examples of what is possible here. In 2007, the US National Science Foundation, recognizing the importance of such standards, established the Community Based Data Interoperability Networks (INTEROP) funding programme for the development of tools, standards, and data management best practices within specific disciplinary communities. ... Although many classes of scientific data aren't ready, or aren't appropriate, for standardization, well chosen investments in standardization show a consistently high pay-off." [@lynch2008big]
"For certain types of important digital objects, there are well-curated, deeply-integrated, special-purpose repositories such as Genbank, Worldwide Protein Data Bank, and UniProt... However, not all datasets or even data types can be captured by, or submitted to, these repositories. Many important datasets emerging from traditional, low-throughput bench science don't fit in the data models of these special-purpose repositories, yet these datasets are no less important with respect to integrative research, reproducibility, and reuse in general. Apparently in response to this, we see the emergence of numerous general-purpose data repositories [e.g., FigShare, Mendeley]. ... Such repositories accept a wide range of data types in a wide range of formats, generally do not attempt to integrate or harmonize the distributed data, and place few restrictions (or requirements) on the descriptors of the data deposition. The resulting data ecosystem, therefore, appears to be moving away from centralization, is becoming more diverse, and less integrated, thereby exacerbating the discovery and re-usability problem for both human and computational stakeholders." [@wilkinson2016fair]
"It would be unwise to bet that these formats [SAM/BAM files] won't change (or even be replaced at some point)---the field of bioinformatics is notorious for inventing new data formats (the same goes with computing in general) ... So learning how to work with specific bioinformatics formats may seem like a lost cause, skills such as following a format specification, manipulating binary files, extracting information from bitflags, and working with application programming interfaces (API) are essential skills when working with any format." [@buffalo2015bioinformatics]
From a working group on bioinformatics and data-intensive science: "Many simple analyses are not automated because data formats are a moving target. ... The community has been slow to share tools, partially because tools are not robust against different input formats." [@barga2011bioinformatics]
"Different centres generate data in different formats, and some analysis tools require data to be in particular formats or require different types of data to be linked together. Thus, time is wasted reformatting and reintegrating data multiple times during a single analysis. For example, next-generation sequencing companies do not deliver raw sequencing data in a format common to all platforms, as there is no industry-wide standard beyond simple text files that include the nucleotide sequence and the corresponding quality values. As a result, carrying out sequencing analyses across different platforms requires tools to be adapted to specific platforms. It is therefore crucial to develop interoperable sets of analysis tools that can be run on different computational platforms depending on which is best suited for a given application, and then stitch those tools together to form analysis pipelines." [@schadt2010computational]
"Many important datasets emerging from traditional, low-throughput bench science don't fit in the data models of ... special-purpose repositories [like Genbank, Worldwide Protein Data Bank, and UniProt], yet these datasets are no less important with respect to integrative research, reproducibility, and reuse in general. Apparently in response to this, we see the emergence of numerous general-purpose data repositories [e.g., FigShare, Mendeley]. ... Such repositories accept a wide range of data types in a wide range of formats, generally do not attempt to integrate or harmonize the distributed data, and place few restrictions (or requirements) on the descriptors of the data deposition. The resulting data ecosystem, therefore, appears to be moving away from centralization, is becoming more diverse, and less integrated, thereby exacerbating the discovery and re-usability problem for both human and computational stakeholders." [@wilkinson2016fair]
"Simplicity, but not oversimplification, is the key to success [in developing standards]." [@brazma2006standards]
"Minimum reporting guidelines, terminologies, and formats (hereafter referred to as reporting standards) are increasingly used in the structuring and curation of datasets, enabling data sharing to varying degrees. However, the mountain of frameworks needed to support data sharing between communities inhibits the development of tools for data management, reuse and integration. ... The same framework [on the other hand] enables researchers, bioinformaticians, and data managers to operate within an open data commons." [@sansone2012toward]
"'One of the core issues of Bioinformatics is dealing with a profusion of (often poorly defined or ambiguous) file formats. Some ad hoc simple human readable formats have over time attained the status of de facto standards.'-- Peter Cock et al. (2010)" [@buffalo2015bioinformatics]
"Developing and using a standard is often an investment that will not pay off immediately, therefore there is a much better chance of success if the user community decides that the respective standard is needed." [@brazma2006standards]
"Although standardization is not a goal in itself, its importance is growing in a high-throughput era. This is similar to what happened to manufacturing during industrialization. The data from high-throughput technologies are being generated at a rate that makes managing and using these data sets impossible on a case-by-case basis. Although some of the data generated by the newest technologies might have a low signal-to-noise ratio to make data re-usable, the data quality is improving as the technology matures, and it is a waste of resources not to share and re-use these expensive datasets. However, this is only possible if the instrumentation that generates these data, laboratory-based storage information management systems and databases, data analysis tools, and systems modeling software can talk to each other easily. This is the purpose of standardization." [@brazma2006standards]
"A standard is successful only if it is used, and it is important to ensure that supporting software tools are designed and implemented." [@brazma2006standards]
"In the late 2000s, there arose the 'NoSQL movement', coalescing around a collective desire of many programmers to move beyond the strictures of the relational model and unshackle themeselves from SQL. Our data varied and diverse, they said, even if programmers weren't that varied and diverse, and we are tired of pretending that one technology will address the need for speed. Dozens of new databases appeared, each with different merits. There were key-value databases, like Kyoto Cabinet, which optimized for speed of retrieval. There were search-engine libraries, like Apache Lucene, which made it relatively eaasy to search through enormous corpora of text---your own Google. There was Mongo DB, which allowed for 'documents', big arbitrary blobs of dataa, to be stored without nice rows and consistent structure. People debated, and continue to debate, the value of each. ... There is as yet no absolute challenger to the relationship model. When people think database, they still think SQL." [@ford2015i]
"By information or data communication standard we mean a convention on how to encode data or information about a particular domain (such as gene function) that enables unambiguous transfer and interpretation of this information or data." [@brazma2006standards]
"The proper acquisition and handling of data is crucially important for both the generation and verification of hypotheses. The rapid development of high-throughput experimental techniques is transforming life-science research into 'big data' science, and although numerous data-management systems exist, the heterogeneity of formats, identifiers, and data schema pose serious challenges. In this context, data-management systems need standardized formats for data exchange, globally unique identifiers for data mapping, and common interfaces that allow the integration of disparate software tools in a computational workflow." [@ghosh2011software]
"Data quality [for health registries data] is driven by multiple dimensions such as clinical data standardization, the existence of common definitions of data fields, and the validity of self-reported patient conditions and outcomes. Recognized issues include the definitions of data fields and their relational structure, the training of personnel related to data collection data processing issues (data cleaning), and curation." [@keller2017evolution]
If you have data in a structured, tabular format that doesn't follow these rules, you don't need to consider it "dirty", though---just think of "tidy" as the tagname for this particular structure of data (the name, in this case, connects the data format with a set of tools in R called the "tidyverse").
"Software systems are transparent when they don't have murky corners or hidden depths. Transparency is a passive quality. A program is passive when it is possible to form a simple mental model of its behavior that is actuaally predictive for all or most cases, because you can see through the machinery to what is actually going on." [@raymond2003art]
"Software systems are discoverable when they include features that are designed to help you build in your mind a correct mental model of what they do and how they work. So, for example, good documentation helps discoverability to a programmer. Discoverability is an active quality. To achieve it in your software, you cannot merely fail to be obscure, you have to go out of your way to be helpful." [@raymond2003art]
"Elegant code does much with little. Elegant code is not only correct but visibly, transparently correct. It does not merely communicate an algorithm to a computer, but also conveys insight and assurance to the mind of a human that reads it. By seeking elegance in our code, we build better code. Learning to write transparent code is a first, long step toward learning how to write elegant code---and taking care to make code discoverable helps us learn how to make it transparent. Elegant code is both transparent and discoverable." [@raymond2003art]
"To design for transparency and discoverability, you need to apply every tactic for keeping your code simple, and also concentrate on the ways in which your code is a communication to other human beings. The first questions to ask, after 'Will this design work?' are 'Will it be reaadable to other people? Is it elegant?' We hope it is clear ... that these questions are not fluff and that elegance is not a luxury. These qualities in the human reaction to software are essential for reducing its bugginess and increasing its long-term maintainability." [@raymond2003art]
"The Unix style of design applies the do-one-thing-well approach at the level of cooperating programs as well as cooperating routines within a program, emphasizing small programs connected by well-defined interprocess communication or by shared files. Accordingly, the Unix operating system encourages us to break our programs down into simple subprocesses, and to concentrate on the interfaces between these subprocesses." [@raymond2003art]
"The ability to combine programs [with piping] can be extremely useful. But the real win here is not cute combinations; it's that because both pipes and more(1) exist, other programs can be simpler. Pipes mean that programs like ls(1) (and other programs that write to standard out) don't have to grow their own pagers---and we're saved from a word of a thousand built-in pagers (each, naturally, with its own divergent look and feel). Code bloat is avoided and global complexity reduced. As a bonus, if anyone needs to customize pager behavior, it can be done in one place, by changing one program. Indeed, multiple pagers can exist, and will all be useful with every application that writes to standard output." [@raymond2003art]
"Unix was born in 1969 and has been in continuous production use ever since. That's several geological eras by computer industry standards. ... Unix's durability and adaptability have been nothing short of astonishing. Other technologies have come and gone like mayflies. Machines have increased a thousand-fold in power, languages have mutated, industry practice has gone through multiple revolutions---and Unix hangs in there, still producing, still paying the bills, and still commanding loyalty from many of the best and brightest software technologists on the planet." [@raymond2003art]
"One of the many consequences of the exponential power-versus-time curve in computing, and the corresponding pace of software development, is that 50% of what one knows becomes obsolete over every 18 months. Unix does not abolish this phenomenon, but does do a good job of containing it. There's a bedrock of unchanging basics---languages, system calls, and tool invocations---that one can actually keep for entire years, even decades. Elsewhere it is impossible to predict what will be stable; even entire operating systems cycle out of use. Under Unix, there is a fairly sharp distinction between transient knowledge and lasting knowledge, and one can know ahead of time (with about 90% certainty) which category something is likely to fall in when one learns it. Thus the loyalty Unix commands." [@raymond2003art]
"Unix is famous for being designed around the philosophy of small, sharp tools, each intended to do one thing well. This philosophy is enabled by using a common underlying format---the line-oriented, plain text file. Databases used for system administration (users and passwords, network configuration, and so on) are all kept as plain text files. ... When a system crashes, you may be faced with only a minimal environment to restore it (you may not be able to access graphics drivers, for instance). Situations such as this can really make you appreciate the simplicity of plain text." [@hunt2000pragmatic]
"Unix is the foundational computing environment in bioinformatics because its design is the antithesis of [a] inflexible and fragile approach. The Unix shell was designed to allow users to easily build complex programs by interfacing smaller modular programs together. This approach is the Unix philosophy: 'This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.'--Doug McIlory". [@buffalo2015bioinformatics]
"Passing the output of one program directly into the input of another program with pipes is a computationally efficient and simple way to interface Unix programs. This is another reason why bioinformaticians (and software engineers in general) like Unix. Pipes allow us to build larger, more complex tools from modular parts. It doesn't matter what language a program is written in, either; pipes will work between anything as long as both programs understand the data passed between them. As the lowest common denominator between most programs, plain-text streams are often used---a point that McIlroy makes in his quote about the Unix philosophy." [@buffalo2015bioinformatics]
If the data is the same regardless of whether it's "tidy" or not, then why all the fuss about following the "tidy" principles when you're designing the format you'll use to record your data? The magic here in this---if you follow these principles, then your data can be immediately input into a collection of powerful tools for visualizing and analyzing the data, without further cleaning steps. What's more, all those tools (the set of tools is calld the "tidyverse") will typically output your data in a "tidy" format, as well.
These small tools can be combined together because they take the same input (data in a "tidy" format) and they output in the same format (also data in a "tidy" format). This is such a powerful idea that many of the best loved toys work on the same principle. Think of interlocking plastic block sets, like Lego. You can create almost anything with a large enough set of Legos, because they can be combined in almost any kind of way. Why? Because they all follow a standard size for the ... on top of each block, and they all "input" ... of that same size on the bottom of the block. That means they can be joined together in any order and combination, and as a result very complex structures can be created. It also means that each piece can be small and easy to understand---if you're building a Lego structure, even something very fancy, you'll probably use lots of rectangular brinks that are two ... across and four ... long, and that's easy enough to describe that you could probably get a young child to help you find those pieces when you need them.
The "tidy" data format is an implementation of a structured data format popular among statisticians and data scientists. By consistently using this data format, researchers can combine simple, generalizable tools to perform complex tasks in data processing, analysis, and visualization.
"Base R graphics came historically first: simple, procedural, conceptually motivated by drawing on a canvas. There are specialized functions for different types of plots. These are easy to call---but when you want to combine them to build up more complex plots or exchange one for another, this quickly gets messy, or even impossible. The user plots ... directly onto a (conceptual) canvas. She explictly needs to deal with decisions such as how much space to allocate to margins, axis labels, titles, legends, subpanels; once something is 'plotted', it cannot be moved or erased. There is a more high-level approach: in the grammar of graphics, graphics are built up from modular logical pieces, so that we can easily try different visualization types for our data in an intuitive and easily deciphered way, just as we can switch in and out parts of a sentence in human language. There is no concept of a canvas or a plotter; rather, the user gives
ggplot2
a high-level description of the plot she wants, in the form of an R object, and the rendering engine takes a holistic view of the scene to lay out the graphics and render them on the output device." [@holmes2018modern]
Older text
It is usually very little work to record data in a structure that follows the "tidy data" principles, especially if you are planning to record the data in a two-dimensional, tabular format already, and following these principles can bring some big advantages. We explain these rules and provide examples of biomedical datasets that both comply and don't comply with these principles, to help make it clearer how you could structure a "tidy-compliant" structure for recording experimental data for your own research.
If the data is the same regardless of whether it's "tidy" or not, then why all the fuss about following the "tidy" principles when you're designing the format you'll use to record your data? The magic here ix this---if you follow these principles, then your data can be immediately input into a collection of powerful tools for visualizing and analyzing the data, without further cleaning steps (as discussed in the previous module). What's more, all those tools (the set of tools is calld the "tidyverse") will typically output your data in a "tidy" format, as well.
Once you have tools that input and output data in the same way, it becomes very easy to model each of the tools as "small, sharp tools"---each one does one thing, and does it really well. That's because, if each tool needs the same type of input and creates that same type of output, those tools can be chained together to solve complex problems. The alternative is to create large software tools, ones that do a lot to the input data before giving you some output. "Big" tools are harder to understand, and more importantly, they make it hard to adapt your own solutions, and to go beyond the analysis or visualization that the original tool creators were thinking of when they created it. Think of it this way---if you were writing an essay, how much more can you say when you can mix and match words to create your own sentences versus if you were made to combine pre-set sentences?
It is likely that there are certain types of experiments that you conduct regularly, and that they're often trying to answer the same type of question and generate data of a consistent type and structure. This is a perfect chance to lay down rules or a pattern for how members of your research group will record that data.
These rules can include:
[Figure: Three tables---measurements on a drug (chemistry), measurements on an animal (weight), measurements on an animal at time points (drug concentration)]
You can then take this information and design a template for collecting that type of data. A template is, in this case, a file that gives the "skeleton" of the table or tables. You will create this template file and save it somewhere easy for lab members to access, with a filename that makes it clear that this is a template. For example, you may create a folder with all the templates for tables for your experiment, and name a template in it for collecting something like animal weights at the start of the experiment something like "animal_wt_table_template.csv" or "animal_wt_table_template.xlsx". Each time someone starts an experiment collecting that type of data, he or she can copy that template file, move it to the directory with files for that experiment and rename it. When you open that copy of the file, you can record observations directly into it.
[Figure: Example template file]
#################################################################################### #################################################################################### # # Column names and meanings # # animal_id: A unique identifier for each animal. # animal_wt_g: The weight of the animal, recorded in grams. # date_wt_measured: The date that the animal's weight was measured, recorded as # "month day, year", e.g., "Jan 1, 2019" # cage_id: A unique identifier for the case in which the animal was housed # # Other table templates for this experiment type: # drug_conc_by_time.csv: A template for recording drug concentrations in the animals # by time point # animal_id, animal_wt_g, date_wt_measured, cage_id "A101", 50.2, "Jan 1, 2019", "B"
Adding in one row of sample values, to be deleted each time the template is copied and used, can be a very helpful addition. This will help the user remember the formats that are expected for each column (for example, the format the date should be recorded in), as well as small details like which columns should include quotation marks.
These template tables can be created as flat files, like comma-separated value files. However, if this is too big of a jump, they can also be created as spreadsheet files. Many of the downsides of spreadsheet files are linked to the use of embedded macros, integration of raw and processed / calculated data, and other factors, rather than related to their use as a method to record data. However, do note that plain text files like flat files can be opened in RStudio in a spreadsheet-like view in RStudio. Data can be recorded directly here, in a format that will feel comfortable for spreadsheet users, but without all the bells and whistles that we're aiming to avoid in spreadsheet programs like Excel.
[Figure---Opening a csv file with a spreadsheet like view]
There are some advantages to shifting to record data in flat files like CSVs, rather than Excel files, and using the spreadsheet-style view in RStudio to work with those files if you find it easier than working with the files in a text editor (which can get tough, since the values in a column don't always visually line up, and you have to remember to put in the right number of columns). By recording the data in a plain text file, you can later move to tracking changes that are made to the data using the version control tool git. This is a powerful tool that can show who made changes to a file and when, with exact details on the changes made and room for comments on why the change was made. However, git does not provide useful views of changes made to binary files (like Excel), only those made in plain text files. Further, plain text files are guaranteed to not try to "outsmart" you---for example, they will not try to convert something that looks like a date into a date. Instead, they will leave things exactly as you typed them in. Finally, later in this book we will build up to creating templates that do even more---for example, templates for reports you need to write and presentations you need to give, as well as templates for the whole structure of a project. Plain text files fit very nicely into this developing framework, while files in complex binary formats like xlxs don't fit as naturally.
Google Sheets is another tool that might come in useful. [More about using this with R.]
This idea of creating template files for data recording isn't revolutionary---many laboratory groups have developed spreadsheet template files that they share and copy to use across similar experiments that they conduct. The difference here is in creating a table for recording data that follows the tidy data principles, or at least comes close to them (any steps away from characteristics like embedded macros and use of color to record information will be helpful).
The next chapter will walk through two examples of changing from non-tidy table templates to ones that record data in a way that follows the tidy data principles.
"Or maybe your goal is that your data is usable in a wide range of applications? If so, consider adopting standard formats and metadata standards early on. At the very least, keep track of versions of data and code, with associated dates." [@goodman2014ten]
"Standards for data include, for example, data formats, data exchange protocols, and meta-data controlled vocabularies." [@barga2011bioinformatics]
"Software systems are transparent when they don't have murky corners or hidden depths. Transparency is a passive quality. A program is passive when it is possible to form a simple mental model of its behavior that is actuaally predictive for all or most cases, because you can see through the machinery to what is actually going on." [@raymond2003art]
"Software systems are discoverable when they include features that are designed to help you build in your mind a correct mental model of what they do and how they work. So, for example, good documentation helps discoverability to a programmer. Discoverability is an active quality. To achieve it in your software, you cannot merely fail to be obscure, you have to go out of your way to be helpful." [@raymond2003art]
"Elegant code does much with little. Elegant code is not only correct but visibly, transparently correct. It does not merely communicate an algorithm to a computer, but also conveys insight and assurance to the mind of a human that reads it. By seeking elegance in our code, we build better code. Learning to write transparent code is a first, long step toward learning how to write elegant code---and taking care to make code discoverable helps us learn how to make it transparent. Elegant code is both transparent and discoverable." [@raymond2003art]
"To design for transparency and discoverability, you need to apply every tactic for keeping your code simple, and also concentrate on the ways in which your code is a communication to other human beings. The first questions to ask, after 'Will this design work?' are 'Will it be reaadable to other people? Is it elegant?' We hope it is clear ... that these questions are not fluff and that elegance is not a luxury. These qualities in the human reaction to software are essential for reducing its bugginess and increasing its long-term maintainability." [@raymond2003art]
"Software is maintainable to the extent that people who are not its author can successfully understand and modify it. Maintainability demands more than code that works; it demands code that follows the Rule of Clarity and communicates successfully to human beings as well as the computer." [@raymond2003art]
"An equivalent to the laboratory notebook that is standard good practice in labwork, we advocate the use of a computational diary written in the R markdown format. ... Together with a version control system, R markdown helps with tracking changes." [@holmes2018modern]
"R.A. Fisher, one of the fathers of experimental design, is quoted as saying 'To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the expierment died of.' So it is important to design an experiment with the analysis already in mind. Do not delay thinking about how to analyze the data until after they have been acquired. ... Dailies: start with the analysis as soon as you have acquired some data. Don't wait until everything is collected, as then it's too late to troubleshoot. ... Start writing the paper while you're analyzing the data. Only once you're writing and trying to present your results and conclusions will you realize what you should have done properly to support them." [@holmes2018modern]
"In the same way a file director will view daily takes to correct potential lighting or shooting issues before they affect too much footage, it is a good idea not to wait until all runs of an experiment have been finished before looking at the data. Intermediate data analyses and visualizations will track unexpected sources of variation and enable you to adjust the protocol. Much is known about the sequential design of experiments, but even in a more pragmatic setting it is important to be aware of your sources of variation as they occur and adjust for them." [@holmes2018modern]
"Analysis projects often begin with a simple script, perhaps to try out a few initial ideas and explore the quality of the pilot data. Then more ideas are added, more data come in, other datasets are integrated, more people become involved. Eventually the paper need to be written, the figures need to be done 'properly' and the analysis needs to be saved for the scientific record and to document its integrity." [@holmes2018modern]
"Use literate programming tools. Examples are Rmarkdown and Jupyter. This makes code more readable (for yourself and for others) than burying explanations and usage instructions in comments in the source code or in separate README files. In addition, you can directly embed figures and tables in these documents. Such documents are good starting points for the supplementary material of your paper. Moreover, they're great for reporting analyses to your collaborators." [@holmes2018modern]
One of the core tenets of programming is the philosophy of "Don't Repeat Yourself" (a.k.a., the "DRY Principle").[Source of "Don't Repeat Yourself"---The Pragmatic Programmer] With programming, you can invest a little bit of time to code your computer to do things that take a lot of your time otherwise. In this way, you can automate repetitive tasks.
"The DRY principle, for Don't Repeat Yourself, is one of the colloquial tenets of programming. That is, you should name things once, do things once, create a function once, and let the computer repeat itself." [ford2015code]
"Code, in other words, is really good at making things scale. Computers may require utterly precise instructions, but if you get the instructions right, the machine will tirelessly do what you command over and over and over again, for users around the world. ... Solve a problem once, and you've solved it for everyone." [Coders, p. 20]
"Since they have, at their beck and call, machines that can repeat instructions with robotic perfection, coders take a dim view of doing things repetitively themselves. They have a dislike of inefficiency that is almost aesthetic---they recoil from it as if from a disgusting smell. Any opportunity they have to automate a process, to do something more efficiently, they will." [Coders, p. 20]
"Programmers are obsessed with efficiency. ... Removing the friction from a system is an aesthetic joy; [programmers'] eyes blaze when they talk about making something run faster, or how they eliminated some bothersome human effort from a process." [Coders, p. 122]
"Computers, in many ways, inspire dreams of efficiency greater than any tool that came before. That's because they're remarkably good at automating repetitive tasks. Write a script once, set it running, and the computer will tirelessly execute it until it dies or the power runs out. What's more, computers are strong in precisely the ways that humans are weak. Give us a repetitive task, and our mind tends to wander, so we gradually perform it more and more irregularly. Ask us to do something at a precise time or interval, and we space out and forget to do it. ... In contrast, computers are clock driven and superb at doing the same thing at the same time, day in and day out." [Coders, p. 124]
"Larry Wall, the famous coder and linguist who created the Perl programming language, deeply intuited this coderly aversion to repetition. In his book on Perl, he and coauthors wrote that one of the key virtues of a programmer is 'laziness'. It's not that you're too lazy for coding. It's that you're too lazy to do routine things, so it inspires you to automate them." [Coders p. 126]
In scientific research, there are a lot of these repetitive tasks, and as tools for automation continue to develop, there are many opportunities to "automate away" busywork.
"Science often involves repetition of computational tasks such as processing large number of data files in the same way or regenerating figures each time new data are added to an existing analysis. Computers were invented to do these kinds of repetitive tasks but, even today, many scientists type the same commands in over and over again or click the same buttons repeatedly." [wilson2014best]
"Whenever possible, rely on the execution of programs instead of manual procedures to modify data. Such manual procedures are not only inefficient and error-prone, they are also difficult to reproduce." [sandve2013ten]
"Other manual operations like the use of copy and paste between documents should also be avoided. If manual operations cannot be avoided, you should as a minimum note down which data fiels were modified or moved, and for what purpose." [sandve2013ten]
Statisticians have been doing this for a while for data cleaning analysis tasks. For example, if you need to read in an Excel file into a statistical programming language like R, you could write a few lines of code to do that anew each time you get a new file. However, say you get Excel files over and over that follow the same format--for example, files with the same number of columns, the same names for those columns, and the same type of data. You can write a script---a recorded file with a few lines of code, in this case---that reads in the file. You can apply this script to each new file.
This saves you a little bit of time. It also ensures that you do the exact same thing with every file you get. It also means that you can reproduce what you do now to a file in the future. Say, for example, that you are working on a project and you read in a file and conduct an analysis. Your laboratory group sends the paper out for review. Months later, you get back comments from the reviewers, and they are wondering what would happen if you had analyzed the data a bit differently---say, used a different statistical test. If you use a script to read in the data file, then when you re-run it to address the reviewers' comments, you can be sure that you are getting your data into the statistical program in the exact same way you did months ago, and so you're not unintentionally introducing differences in your results because you are doing some small things differently in processing the file.
This idea can extend across the full data analysis you do on a project. You are only saving a little bit of time and effort, maybe, by automating the step where you read the data from a spreadsheet into the statistical program. And it takes some time to write that script the first time, so it can be tempting to do it fresh each time you need to do it. However, you can also write scripts that will automate cleaning your data. Maybe you want to identify data points with very high (maybe suspect) values for a certain measurement, or remove observations with missing data. You can also write scripts that will automating processing your data---doing things like calculating the time since the start of an experiment based on the recorded sampling time for an observation. Each of these steps might be small, but the time saved really adds up since you typically need to perform many of these steps each time you run a new experiment.
There are many cases in life where you'll need to make the choice between spending some time upfront to make something more efficient, versus doing it more quickly the first time but then having to do it "from scratch" again each following time. For example, say that you're teaching a class, and you need to take attendance for each class period. You could write down the names of each student at the first class and save that, and then the next class write down the name of each student who shows up that day on a separate sheet of paper, and so on for each class meeting. Conversely, you could take some extra time before the first class and create a table or spreadsheet file with every student's name and the date of each class, and then use that to mark attendance. The first method will be quicker the first day, but more time consuming each following time. The second method requires a small initial investment, but with time-saving returns in the following class meetings.
For people who use scripts and computer programs to automate their data-related tasks, it quickly becomes very confusing how anyone who doesn't could argue that they don't because they don't have time to learn how to. The amount of time you end up saving based on your initial investment is just so high if you're working with data, that it would have to take a huge time investment to not be worth it. Plus---the thrill of running something that you've automated! It's a very similar feeling to the feeling you get when a student or postdoc that you've spent a lot of time training has gotten to the point where you can just ask them to run something, and they do, and it means you don't have to.
Here are some of the problems that are solved by automating your small tasks:
It gets done the same way every single time. Even simple tasks can be done with numerous small modifications. You will probably remember some of those choices and settings and modifications the next time you need to do the same thing, but probably not all of them, and so those choices will not be exact from one time to the next. If the computer is doing it based on a clear set of instructions, it will be.
It gets done more quickly. Or if not more quickly (some large data might take some time to process), at least the spent time is the computers time, not yours. You can leave the computer to run the script while you get on with other things that a computer can't do.
Anyone who does it can do it the same way. Just as you might not do something exactly the same way from one time to the next, one person in a laboratory group is likely to do things at least slightly different than other members of the group. Even with very detailed instructions, few instructions written for humans can be so detailed and precise to ensure that something is done exactly the same way by everyone who follows them. If everyone is given the same computer script to run, however, and they all instruct the computer to run that script, the task will be done in exactly the same way.
It is easier teach new people how to do the task. Often, with a script to automate a task, you just need to teach someone new to the laboratory group how to get the computer to run a script in a certain language. When you need them to run a new script, the process will be the same. The script encapsulates all the task-specific details, and so the user doesn't need to understand all of them to get something to run. What's more, once you want to teach a new lab member how everything it working, so they can understand the full process, the script provides the exact recipe. You can teach them how to read and understand scripts in that language, and then the scripts you've created to automate tasks serve as a recipe book for everything going on in terms of data analysis for the lab.
You can create tools to share with others. If you've written a script that's very useful, with a bit more work you can create it into a tool that you can share with other research groups and perhaps publish a paper about. Papers about R software extensions (also called packages) and data analysis workflows and pipelines are becoming more and more common in biological contexts.
It's more likely to be done correctly. Boring, repetative tasks are easy to mess up. We get so bored with them, that we shift our brains into a less-attentive gear when we're working on them. This can lead to small, stupid mistakes, ones at the level of typos but that, with data cleaning and analysis, can have much more serious ramifications.
"We view workflows as a paradigm to: 1) expose non-experts to well-understood end-to-end data analysis processes that have proven successful in challenging domains and represent the state-of-the-art, and 2) allow non-experts to easily experiment with different combinations of data analysis processes, represented as workflows of computations that they can easily reconfigure and that the underlying system can easily manage and execute." [hauder2011making]
"While reuse [of workflows] by other expert scientists saves them time and effort, reuse by non-experts is an enabling matter as in practice they would not be able to carry out the analytical tasks without the help of workflows." [hauder2011making]
"We observed that often steps that could be easily automated were performed manually in an error-prone fashion." [vidger2008supporting]
Biological research is quickly moving where a field where projects often required only simple and straightforward data analysis once the experimental data was collected---with the raw data often published directly in a table in the manuscript---to a field with very complex and lengthy data analysis pipelines between the experiment and the final manuscript. To ensure rigor and clarity in the final research results, as well as to allow others to reproduce the results exactly, the researcher must document all details of the computational data analysis, and this is often missing from papers. RMarkdown documents (and their analogues) can provide all these details unambiguously---with RMarkdown documents, you can even run a command to pull out all the code used within the document, if you'd like to submit that code script as a stand-alone document as a supplement to a manuscript.
"More recently, scientists who are not themselves computational experts are conducting data analysis with a wide range of modular software tools and packages. Users may often combine these tools in unusual or nove ways. In biology, scientists are now routinely able to acquire and explore data sets far beyond the scope of manual analysis, including billions of DNA bases, millions of genotypes, and hundreds of thousands of RNA measurements. ... While propelling enormous progress, this increasing and sometimes 'indirect' use of computation poses new challenges for scientific publication and replication. Large datasets are often analyzed many times, with modifications to the methods and parameters, and sometimes even updates of the data, until the final results are produced. The resulting publication often gives only scant attendtion to the computations details. Some papers have suggested these papers are 'merely the advertisement of scholarship whereas computer programs, input data, parameter values, etc., embody the scholarship itself.' However, the actual code or software 'mashup' that gave rise to the final analysis may be lost or unrecoverable." [mesirov2010accessible]
"Bioinformatic analyses invariably involve shepherding files through a series of transformations, called a pipeline or workflow. Typically, these transformations are done by third-part executable command line software written for Unix-compatible operating systems. The advent of next-generation sequencing (NGS), in which millions of short DNA sequences are used as the source input for interpreting a range of biological phenomena, has intensified the need for robust pipelines. NGS analyses tend to involve steps such as sequence alignment and genomic annotation that are both time-intensive and parameter-heavy." [leipzig2017review]
"Rule 7: Always Store Raw Data behind Plots. From the time a figure is first generated to it being part of a published article, it is often modified several times. In some cases, such modifications are merely visual adjustments to improve readability, or to ensure visual consistency between figures. If raw data behind figures are stored in a systematic manner, so as to allow raw data for a given figure to be easily retrieved, one can simply modify the plotting procedure, instead of having to redo the whole analysis. An additional advantage of this is that if one really wants to read fine values in a figure, one can consult the raw numbers. ... When plotting is performed using a command-based system like R, it is convenient to also store the code used to make the plot. One can then apply slight modifications to these commands, instead of having to specify the plot from scratch." [sandve2013ten]
Until a few years ago, statisticians and data analysts frequently automated the data cleaning, processing, and analysis tasks. But that still left the paper and report writing to be done by hand. This process is often repetitive. You would do your analysis and create some tables or figures. You would save these from your statistical program and then paste them into your report or paper draft. If you decided that you needed to change your analysis a bit, or if you got a new set of data to analyze in a similar way, you had to go back to the statistical program, run things again there, save the tables and figure files again, and paste them in the report or paper again to replace the outdated version. If there were numbers from the analysis in the text of the paper, then you had to go back through the text and update all of those with the newer numbers, too.
Do you still write your papers and reports like this? I can tell you that there is now a much better way. Computer scientists and other programmers started thinking quite a while ago about how to create documents that combine computer code and text for humans, and to do it in a way where the computer code isn't just a static copy of what someone once told the computer to do, but instead a living, working, executable set of instructions that the computer can run anytime you ask it to.
These ideas first perculated with Donald Knuth, who many consider to be the greatest computer programmer of all time [Bill Gates, for example, has told anyone who reads Dr. Knuth's magnum opus, The Art of Computer Programming, to come see him right away about a job]. As Dr. Knuth was writing a book on computer programming, he became frustrated with the quality of the typesetting used in the final book. In a field that requires a lot of mathematical and other symbols incorporated into the text, it takes a bit more to make an attractive book than with simpler text. Dr. Knuth therefore took some time to create a programming for typesetting. (You may have heard of it---if you ever notice that a journal's Instructions to Authors allow authors to submit articles in "LaTeX" or "TeX", that's using a system built off of Donald Knuth's typesetting program.)
And then, once he had that typesetting program, he started thinking about how programmers document their code. When one person does a very small code project, and that one person is the only person who will ever go back to try to modify or understand the code, that person might be able to get away with poor documentation in the code. However, interesting code projects can become enormous, with many collaborators, and it becomes impossible to understand and improve the code if it doesn't include documentation explaining, in human terms, what the code is doing at each step, as well as some overall documentation explaining how different pieces of the code coordinate to get something big done.
Traditionally, code was documented by including small comments within the code. These comments are located near the code that they explain, and the order of the information in the code files are therefore dominated by the order of the instructions to the computer, not the order that you might explain what's going on to a human. To "read" the code and the documentation, you end up hopscotching through the code, following the code inside one function when it calls another function, for example, to where the code for that second function is defined and then back to the first, and so on. You often follow paths as you get deeper and deeper into helper functions and the helper functions for those functions, that you feel like you're searching through a set of Russian dolls and then coming back up to start on a new set of Russian dolls later down the line.
Donald Knuth realized that, with a good typesetting program that could itself be programmed, you could write your code so that the documentation for humans took precedence, and could be presented in a very clear and attractive final document, rather than hard-to-read computer code with some plain-text comments sprinkled in. Computers don't care what order the code is recorded in---as long as you give them some instructions on how to decipher code in a certain format or order, they can figure out how to use it fine. But human brains are a bit more finicky, and we need clear communication, laid out in a logical and helpful order. Donald Knuth created a paradigm of literate programming that interleaved executable code inside explanations written for humans; by making the code executable, it meant that the document was a living guide. When someone changed the program, they did it by changing the documentation---documentation wasn't left as the final, often neglected, step to refine once the "real code" was written (and the "real work" done).
"Programs must be written for people to read, and only incidentally for machines to execute. A great program is a letter from current you to future you or the the person who inherits your code. A generous humanistic document." [ford2015what]
Well, this was a fantastic idea. It hasn't been universely leveraged, but the projects that do leverage it are much stronger for it. But that's not where the story ends. If you are someone who does a little bit of coding (maybe small scripts to analyze and visualize your data, for example) and a lot of "documenting" of the results, and if you're not planning on doing a lot of large coding projects or creating software tools, it's not immediately how you'd use these literate programming ideas.
Well, there are many people who do a little bit of programming in service to a larger research project. While they are not creating software that needs classical software documentation, they do want to document the results that they get when they run their scripts, and they want to create reports and journal articles to share what they've found. Several people took the ideas behind literate programming---as it's used to document large software projects---and leveraged it to create tools to automate writing in data-related fields.
[F. Leisch?] was the first to do this with the R programming language, with a tool called "Sweave" ("S"-"weave", as R builds off of another programming language called "S" and Leisch's program would "weave" together S / R code and writing). This used Donald Knuth's typesetting program. It allowed you to write a document for humans (like a report or journal article) and to intersperse bits of code in the paper. You'd put each code piece in the spot in the paper where the text described what was going on or where you wanted the results that it generated for example, if you had a section in the Methods where you talked about removing observations that were outliers, you would add in the code that took out those outliers right there in the paper. And if you had a placed in the Results that talked about how your data were different between two experimental groups, you would add the code that generated the plot to show that right there in the paper.
To tell the computer how to tell between code and writing, you would add a little weird combination of text each time that you wanted to "switch" into code and then another one each time you wanted to switch back into writing for humans. (These combinations were so weird because that guaranteed that it was a combination you would probably never want to type otherwise, so you wouldn't have a lot of cases of the computer getting confused between whether the combo meant to switch to code or whether it was just something that came up in the regular writing.) You'd send the document, code and writing and all, through R once you had it written up. R would ignore everything that wasn't code. When it got to the code pieces, it would run them, and if the code created output (like a figure or table), it would "write" that into the document at that point in the text. Then you'd run the document through Donald Knuth's typesetting program (or an extension of it), and the whole document would get typeset into an attractive final product (often a pdf, although you had some choices on the type of output).
This meant that you got very attractive final documents. It also meant that your data analysis code was well documented---it was "documented" by the very article or report you wrote based on it, because the code was embedded right there in the final product! It also meant that you could save a lot of time if you needed to go back and change some of your code later (or input a different or modified dataset). You just had to change that small piece of code or data input, and then essentially press a button to put everything together again, and the computer would re-write the whole report for you, with every figure and table updated. It even let you write small bits of computer code directly into the written text, in case you need to write something like "this study included 52 subjects", where the "52" came from you counting up the number of rows in one of your datasets---if you later added three more subjects and re-ran the analysis with the updated dataset, the report would automatically change to read "this study included 55 subjects".
Leisch's system is still out there, but another has been adopted much more widely, building on it. Yihui Xie started work on a program that tweaked and improved Leisch's Sweave program, creating something called "knitr" ("knit"-"R"---are you noticing a pattern in the names?). Xie's knitr program, along with its extensions, is now widely used for data analysis projects. What's more, it's grown to allow for larger or more diverse writing projects---this book, for example, is written using an extension called "bookdown", and extensions also exist for create blogs that include executable R code ("blogdown") and websites with documentation for R packages ("packagedown").
So now, let's put these two pieces together. We know that programmers love to automate small tasks, and we know that there are tools that can be used to "program" tasks that involve writing and reporting. So what does this mean if you frequently need to write reports that follow a similar pattern and start from similar types of data? If you are thinking like a code, it means that you can move towards automating the writing of those reports.
One of us was once talking to someone who works in a data analysis-heavy field, and she was talking about how much time she spends copying the figures that her team creates, based on a similar analysis of new data that's regularly generated, into PowerPoint presentations. So, for this weeks report, she's creating a presentation that shows the same analysis she showed last week, just with newer data. Cutting and pasting is an enormous waste of time---there are tools to automate this.
First---think through the types of written reports or presentations you've created in the past year or two. Are there any that follow a similar pattern? Any that input the same types of data, but from different experiments, and then report the same types of statistics or plots for them? Are there Excel spreadsheets your lab uses that generate specific tables or plots that you often cut and paste for reports or presentations? Look through your computer file folders or email attachments if you need to---many of these might be small regular reports that are so regular that they don't pop right to mind. If you are creating documents that match any of these conditions, you probably have something ripe for converting to a reusable, automatable template.
"Think like Henry Ford; he saw that building cars was a repeatable process and came up with the moving assembly line method, revolutionizing production. You may not be building a physical product, but chances are you are producing something. ... Look for the steps that are nearly identical each time, so you can build your own assembly line." [rose2018dont]
... [Creating a framework for the report]
"Odds are, if you're doing any kind of programming, especially Web programming, you've adopted a framework. Whereas an SDK is an expression of a corporate philosophy, a framework is more like a product pitch. Want to save time? Tired of writing old code? Curious about the next new thing? You use a graphics framework to build graphical applications, a Web framework to build Web applications, a network framework to build network servers. There are hundreds of frameworks out there; just about every language has one. A popular Web framework is Django, which is used for coding in Python. Instagram was bootstrapped on it. When you sit down for the first time with Django, you run the command 'startproject', and it makes a directory with some files and configuration inside. This is your project directory. Now you have access to libraries and services that add to and enhance the standard library." [ford2015what]
One key advantage of creating a report template is that it optimizes the time of statistical collaborators. It is reasonable for a scientists with a couple of courses worth of statistical training to design and choose the statistical tools for simple and straightforward data analysis. However, especially as the biological data collected in experiments expands in complexity and size, a statistician can recommend techniques and approaches to draw more knowledge from the data and to appropriately handle non-standard features of the data. There is substantial work involved in the design of any data analysis pipeline that goes beyond the very basics. It waste time and resources to recreate this with each new project, time that---in the case of statistical collaborators---could probably be better spent in extending data analysis goals beyond the simplest possible analysis to explore new hypotheses or to add exploratory analysis that could inform the design of future experiments.
"Workflows effectively capture valuable expertise, as they represent how an expert has designed computational steps and combined them into an end-to-end process." [hauder2011making]
When collaborative work between scientists and statisticians can move towards developing repeatable data analysis scripts and report templates, you will start to think more about common patterns and common questions that you ask across many experiments in your research program, rather than focusing on the immediate needs for a specific project. You can start to think of the data analysis tools that are general purpose for your research lab, develop those into clean, well-running scripts or functions, and then start thinking about more sophisticated questions you want to ask of your data. The statisticians you collaborate will be able to see patterns across your work and help to develop global, and perhaps novel, methods to apply within your research program, rather than piecemeal small solutions to small problems.
"Although foundational knowledge is taught in major universities and colleges, advanced data analytics can only be acquired through hands-on practical training. Only exposure to real-world datasets allows students to learn the importance of preparing and cleansing the data, designing appropriate features, and formulating the data mining task so that the data reveals phenomena of interest. However, the effort required to implement such complex multi-step data analysis systems and experiment with the tradeoffs of different algorithms and feature choices is daunting. For most practical domains, it can take weeks to months for a student to setup the basic infrastructure, and only those who have access to experts to point them to the right high-level design choices will endeavor on this type of learning. As a result, acquiring practical data analytics skills is out of reach for many students and professionals, posing severe limitations to our ability as a society to take advantage of our vast digital data resources." [hauder2011making]
"In practice designing an appropriate end-to-end process to prepare and analyze the data plays a much more influential role than using a novel classifier or statistical model." [hauder2011making]
It is neither quick nor simple to design the data analysis plan and framework for a research experiment. It is not simply naming a statistical test or two. Instead, the data analyst must start by making sure they understand the data, how it was measured, how to decipher the format in which it's stored, what questions the project is hoping to answer, where there might be problems in the data (and what they would look like), and so on. If a data analyst is helping with a lot of projects using similar types of data to answer similar questions, then he or she should, in theory, need less time for these "framework" types of questions and understanding. However, if data isn't shared in the same format each time, it will still take overhead to figure out that this is indeed the same type of data and that code from a previous project can be adapted or repurposed.
Let's think about one area where you likely repeat very similar steps frequently---writing up short reports or slide presentations to share your to-date research results with your research group or colleagues. These probably often follow a similar structure. For example, they may start with a section describing the experimental conditions, and then have a slide showing a table with the raw data (or a simple summary of it, if there's a lot of data), and then have a figure showing something like the difference in experimental measurements between to experimental groups.
[Figure: Three simple slides for a research update---experimental conditions, table of raw data, boxplots with differences between groups.]
"The cornerstone of using DRY in your work life is the humble template. Whenever you create something, whether it's an email, a business document, or an infographic, think if there's something there you could save for future use. The time spend creating a template will save you exponentially more time down the road." [rose2018dont]
You could start very simply in turning this into a template. You could start by creating a PowerPoint document called "lab_report_template.pptx". It could include three slides, with the titles on each slide of "Experimental conditions", "Raw data", and "Bacterial burden by group", and maybe with some template set to provide general formatting that you like (font, background color, etc.). That's it. When you need to write a new report, you copy this file, rename the copy, and open it up. Now instead of needing to start from a blank PowerPoint file, you've shaved off those first few steps of setting up the pieces of the file you always use.
[Figure: Simplest possible template]
This very simple template won't save you much time---maybe just a minute or so for each report. However, once you can identify other elements that you commonly use in that type of report, you can add more and more of these "common elements" to the template, so that you spend less time repeating yourself with each report. For example, say that you always report the raw data using the same number of columns and the same names for those columns. You could add a table to that slide in your template, with the columns set with appropriate column names. You can always add or delete rows in the table if you need to in your reports, but now each time you create a new report, you save yourself the time it takes to create the table structure and add the column names. Plus, now you've guaranteed that the first table will use the exact same column names every time you give a report! You'll never have to worry about someone wondering if you are using a different model animal because you have a column named "Animal ID" in one report, while your last report had "Mouse ID", for example. And because you're making a tool that you'll use many times, it becomes worthwhile to take some time double-checking the clean-up, so you're more likely to avoid things like typos in the slide titles or in columns names of tables.
[Figure: Template with a table skeleton added.]
You can do the same thing for written reports or paper manuscripts. For example, most of the papers you like may have the classic scientific paper sections: "Introduction", "Data and Methods", "Results", and "Discussion". And then, you probably typically include a couple of pages at the beginning for the title page and abstract, and then a section at the end with references and figure captions. Again, you could create a file called "article_template.docx" with section headings for each of the sections and with space for the title page, abstract, and references. Presumably, you are always an author on papers you're writing, so go ahead and add your name, contact information, and affiliation in the right place on the title page (I bet you have to take the time to do that every time you start a paper---and if you're like me, you have to look up the fax number for your building every time you do). You probably need to mention funding sources on the title page for every paper, too. Do you need to look those grant numbers up every time? Nope! Just put all your current ones in the title page of your template, and then you can just delete those that don't apply when you start a new paper.
[Figure: Simple article template]
Again, you can build on this simple template. Look through the "Data and Methods" section of several of your recent papers. Are there certain elements that you commonly report there? For example, is there a mouse model you use in most of your experiments, that you need to describe? Put it in the template. Again, you can always delete or modify this information if it doesn't apply to a specific paper. But for any information that you find yourself copying and pasting from one paper draft to another, add it to your template. It is so much more delightful to start work on a paper by deleting the details that don't apply than by staring down a blank sheet of paper.
[Quote---Taking away everything that isn't the statue.]
"Most docs you work on will have some sort of repeatable process. For example, when I sit down to write a blog post, I go through the same repeatable steps when setting up my file: Title, Subtitle, Focus Keywords, Links to relevant articles / inspiration, Outline of subheds, Intro / hook, etc. ... Even though it is a well-worn process, I can save time by creating a writing template with these sections already pre-set. Not only does this save time, but it also saves mental energy and helps push me into 'Writing' mode instead of 'Set-up' or 'Research' mode." [mackay2019dry]
This template idea is so basic, and yet far fewer people use it than would seem to make sense. Maybe it's because it does require some forward thinking, about the elements of presentations, reports, and papers that are common across your body of work, not just the details that are pertinent to a specific project. It also does require some time investment, but not much more that adding all these element to a single paper or presentation takes. If you can see the appeal of having a template for the communication output that you create from your research, and if you try it an like it, then you are well on your way to having a programmers mindset. The joy of programming is exactly this kind of joy---a little thinking and time at the start and you have these little tools that do some of your work for you over and over again. In fact, a Python programmer has even written a book whose title captures this intrinsic esprit: "[Automating the Boring Stuff?].
But wait. There's more. Do you always do the same calculations or statistical tests with the data you're getting in? Or at least often enough that it would save time to have a template? There is a way to add this into the template that you create for your presentation, report, or paper.
"Your templates are living documents. If you notice that you're making the same change over and over, that means it's time to update the template itself." [rose2018dont]
Researchers create and use Excel templates for this purpose. The template may have macros embedded in it to make calculations or create basic graphs. However, spreadsheets---whether created from templates or not---share the limitations discussed in an earlier chapter. What's more, they can't easily be worked into a template that creates a final document to communicate results, whether that's a slide presentation or a a written document. Finally, they are in a binary format that can't clearly be tracked with version control like git.
[R Project templates? Can you create them? Clearly something like that is going on when you start a new package...]
Scientific workflows or pipelines have become very popular in many biological research areas. These are meant to meet many of the DRY goals---create a recipe that can be repeated at different times and by different research groups, clearly record each step of an analysis, and automate steps or processes that are repeated across different research projects so they can be completed more efficiently.
There are very sophisticated tools now available for creating biological data analysis pipelines and workflows,[leipzig2017review] including tools like Galaxy and Taverna. Simple code scripts and tools that build on them (like makefiles, RMarkdown documents, and Jupyter Notebooks), however, can be thought of as the simpler (and arguably much more customizable) little sibling of these more sophisticated tools.
"Scripts, written in Unix shell or other scripting languages such as Perl, can be seen as the most basic form of pipeline framework." [leipzig2017review]
"Naive methods such as shell scripts or batch files can be used to describe scientific workflows." [mishima2011agile]
Flexibility can be incorporated into scripts, and the tools that build directly off them, through including variables, which can be set in different configurations each time the script is run [leipzig2017review].
More complex pipeline systems do have some advantages (although generalizable tools that can be applied to scripts are quickly catching up on most of these). For example, many complex data analysis or processing steps may use open-source software that is under continuing development. If the creators of that software modify it between the time that you submit your first version of an article and the time that you need to submit revisions, and you have updated the version of the package on your computer, the code may no longer run the same way. The same thing can happen if someone else tries to run your code---if they are trying to run it with a more recent version of some of the open-source software used in the code, they may run into problems.
This problem of changes in dependencies of the code (software programs, packages,
or extensions that the code loads as runs as part of its process) is an important
challenge to reproducibility in many areas of science. Pipeline software can improve
on simpler scripts by helping limit dependency problems [by ...]. However,
R extensions are rapidly being developed that also address this issue. For example,
the packrat
package ...., while [packrat update Nichole was talking about].
"Dependencies refer to upstream files (or tasks) that downstream transformation steps require as input. When a dependency is updated, associated downstream files should be updated as well." [leipzig2017review]
The tools that we've discussed for reproducable and automatable report writing---like Rmarkdown and Jupyter Notebooks---build off of a tool for coordinating and conducting a process involving multiple scripts and input files, or a "build tool". Among computer programmers, perhaps the most popular build tool is called "make". This tool allows coders to write a "Makefile" that details the order that scripts should be run in a big process, and what other scripts and inputs they require. With these files, you can re-run a whole project, and do it in the right order, and the only steps that will be re-run are those where something will change based on whatever change you just made to the code or input data.
"To avoid errors and inefficiencies from repeating commands manually, we recommend that scientists use a build tool to automate workflows, e.g., specify the ways in which intermediate data files and final results depend on each other, and on the programs that create them, so that a single command will regenerate anything that needs to be regenerated." [wilson2014best]
For example, say that you have a large project that starts by inputing data, cleans or processes it using a step that takes a lot of time to run, analyzes the simpler processed data, and then creates some plots and tables based on this analysis. With a makefile, if you want to change the color of the labels on a plot, you can change that code and re-run the Makefile, and the computer will re-make the plots, but not re-run the time-intensive early data processing steps. However, if you update the raw data for the project and re-run the Makefile, the computer will (correctly) run everything from the very beginning, since the updated data needs to be reprocessed, all the way through to creating the final plots and tables.
"A file containing commands for an interactive system is often called a script, though there is really no difference between this and a program. When these scripts are repeatedly used in the same way, or in combination, a workflow management tool can be used. The most widely used tool for this task is probably Make, although many alternatives are now available. All of these allow people to express the dependencies between files, i.e., to say that if A or B has changed, then C needs to be updated using a specific set of commands. These tools have been successfully adopted for scientific workflows as well." [wilson2014best]
"This experience motivated the creation of a way to encapsulate all aspects of our in silico analyses in a manner that would facilitate independent replication by another scientist. Computer and computational scientists refer to this goal as 'reproducible research', a coinage attributed to the geophysicist Jon Claerbout in 1990, who imposed the standard of makefiles for construction of all the filgures and computational results in papers published by the Stanford Exploration Project. Since that time, other approaches have been proposed, including the ability to insert active scripts within a text document and the use of a markup language that can produce all of the text, figures, code, algorithms, and settings used for the computational research. Although these approaches may accomplish the goal, they are not practical for many nonprogramming experimental scientists using other groups' or commercial software tools today." [mesirov2010accessible]
"All science campaigns of sufficient complexity consist of numerous interconnected computational tasks. A workflow in this context is the composition of several such computing tasks." [deelman2018future]
"Scientific applications can be very complex as software artifacts. They may contain a diverse amalgam of legacy codes, compute-intensive parallel codes, data conversion routines, and remote data extraction and preparation. These individual codes are often stitched together using scripted languages that specify the data and software to be executed, and orchestrate the allocation of computing resources and the movement of data across locations. To manage a particular set of codes, a number of interdependent scripts may be used." [gil2008data]
[Disadvantages of more complex pipeline tools over starting from scripts]
"Unlike command line-based pipeline frameworks ... workbenches allow end-users, typically scientists, to design analyses by linking preconfigured modular tools together, typically using a drag-and-drop graphical interface. Because they require exacting specifications of inputs and outputs, workbenches are intrinsically a subset of configuration-based pipelines." [leipzip2017review]
"Magnificent! Wonderful! So, what's the downside? Well, frameworks lock you into a way of thinking. You can look at a website and, with a trained eye, go, 'Oh, that's a Ruby on Rails site.' Frameworks have an obvious influence on the kind of work developers can do. Some people feel that frameworks make things too easy and that they become a crutch. It's pretty easy to code yourself into a hole, to find yourself trying to force the framework to do something it doesn't want to do. Django, for example, isn't the right tool for building a giant chat application, nor would you want to try competing with Google Docs using a Django backend. You pay a price in speed and control for all that convenience. The problem is really in knowing how much speed, control, and convenience you need." [ford2015what]
"Workbenches and class-based frameworks can be considered heavyweight. There are costs in terms of flexibility and ease of development associated with making a pipeline accessible or fast. Integrating new tools into workbenches clearly increases their audience but, ironically, the developers who are most capable of developing plug-ins for workbenches are the least likely to use them." [leipzip2017review]
"Business workflow management systems emerged in the 1990's and are well accepted in the business community. Scientific workflows differ from business workflows in that rather than coordinating activities between individuals and systems, scientific workflows coordinate data processing activities." [vigder2008supporting]
"The concept of workflows has traditionally been used in the areas of process modelling and coordination in industries. Now the concept is being applied to the computational process including the scientific domain." [mishima2011agile]
"Although bioinformatics-specific pipelinessuch as bcbio-nextgen and Omics Pipe offer high performance automated analysis, they are not frameworks in the sense they are not easily extensible to integrate new user-defined tools." [leipzig2017review]
Writing a script-based pipeline does require that you or someone in your laboratory group develops some expertise in writing code in a "scripting language" like R or Python. However, the barriers to entry for these languages continues to come down, and with tools that leverage the ideas of templating and literate programming, it is becoming easier and easier for new R or Python users to learn to use them quickly. For example, one of us teaches a three-credit R Programming class, designed for researchers who have never coded. The students in the class are regularly creating code projects by the end of the class that integrate literate programming tools to weave together code and text and saving these documents within code project directories that include raw data, processed data, and scripts with code definitions for commonly used pieces of code (saved as functions). These are all the skills you'd need to craft an R project template for your research group that can serve as a starting point for each future experiment or project.
"Without an easy-to-use graphical editor, developing workflows requires some programming knowledge." [vigder2008supporting]
"Scripting languages are programming languages and as a result are inaccessible to any scientists without computing background. Given that a major aspect of scientific research is the assembly of scientific processes, the fact that scientists cannot assemble or modify the applications themselves results in a significant bottleneck." [gil2008data]
"As anyone who's ever shared a networked folder---or organized a physical filing cabinent---knows, without a good shared filing system your office will implode." [ford2015code]
"You can tell how well code is organized from across the room. Or by squinting or zooming out. The shape of code from 20 feet away is incrediably informative. Clean code is idiomatic, as brief as possible, obvious even if it's not heavily documented. Colloquial and friendly." [ford2015code]
"[Wesley Clark] wanted to make the world's first 'personal computer', one that could fit in a single office or laboratory room. No more waiting in line; one scientist would have it all to himself (or, more rarely, herself). Clark wanted specifically to target biologists, since he knew they often needed to crunch data in the middle of an experiment. At that time, if they were using a huge IBM machine, they'd need to stop and wait their turn. If they had a personal computer in their own lab? They could do calculations on the fly, rejiggering their experiment as they went. It would even have its own keyboard and screen, so you could program more quickly: no clumsy punch cards or printouts. It would be a symbiosis of human and machine intelligence. Or, as Wilkes put it, you'd have 'conversational access' to the LINC: You type some code, you see the result quickly. Clark knew he and his team could design the hardware. But he needed Wilkes to help create the computers' operating system that would let the user control the hardware in real time. And it would have to be simple enough that biologists could pick it up with a day or two of training." [Coders, p. 32]
"When they had a rough first prototype [of the LINC] working, Clark tested it on a real-life problem of biological research. He and his colleague Charles Molnar dragged a LINC out to the lab of neurologist Arnold Starr, who had been trying and failing to record the neuroelectric signals cats produce in their brains when they heard a sound. Starr had put an electrode implant into a cat's cortex, but he couldn't distinguish the precise neuroelectirc signal he was looking for. In a few hours, Molnar wrote a program for the LINC that would play a clicking noise out of a speaker, record precisely when the electrode fired, and map on the LINC's screen the average response of the cat to noises. It worked: As data scrolled across the screen, the scientists 'danced a jig right around the equipment'." [Coders, p. 33]
If you have built a pipeline as an R or Python script, but there is an open source software tool that you need to use that is written in another language, you can right a "wrapper" function that calls that software from within the R or Python process. And chances are good, if that software is a popular tool, that someone else has already written one, so you can just leverage that code or tool. Open-source scripting languages and R and Python "play well with others", and can communiciate and run just about anything that you could run at a command line.
"Our approach to dealing with software integration is to wrap applications with Python wrappers." [vigder2008supporting]
The templating process can eventually extend to making small tools as software functions and extensions. For example, if you regularly create a certain type of graph to show your results, you could write a small function in R that encapsulates the common code for creating that. One research group I know of wanted to make sure their figures all had a similar style (font, color for points, etc.), but didn't like the default values, and so wrote a small function that applied their style choices to every plot they made. Once your research group has a collection of these small functions, you can in turn encapsulate them in a R package (which is really just a collection of R functions, plus maybe some data and documentation). This package doesn't have to be shared outside your research group---you can just share it internally, but then everyone can load and use it in their computational work. With the rise of larger datasets in many fields, and the accompanying need to do more and more work on the computer to clean, manage, and analyze the data, more scientists are getting into this mindset that they are not just the "end users" of software tools, but they can dig in and become artisans of small tools themselves, building on the larger structure and heavier lifting made available by the base software package.
"End-user software engineering refers to research dedicated to improving the capability of end-users who need to perform programming or engineering tasks. For many, if not all, of these end-users, the creation and maintenance of software is a secondary activity performed only in service of their real work. This scenario applies to many fields include science. However, there is little research specifically focused on scientists as end-user software engineers." [vigder2008supporting]
John Chambers (one of the creators of R's precursor S, and heavily involved in R deveopment) defines programming as "a language and environment to turn ideas into new tools." [Programming with Data, p. 2]
"Sometimes, it seems that the software we use just sort of sprang into existance, like grass growing on the lawn. But it didn't. It was created by someone who wrote out---in code---a long, painstaking set of instructions telling the computer precisely what to do, step-by-step, to get a job done. There's a sort of priestly class mystery cultivated around the word algorithm, but all they consist of are instructions: Do this, then do this, then do this. News Feed [in Facebook] is now an extraordinarily complicated algorithm involving some trained machine learning; but it's ultimately still just a list of rules." [Coders, p. 10]
"One of your nonnegotiable rules should be that every person in the lab must keep a clear and detailed laboratory notebook. The business of the lab is results and the communication of those results, and the lab notebook is the all-important documentation of each person's research. There are dozens of reasons to keep a clear and detailed lab notebook and only one---laziness---for not. Whether the work is on an esoteric branch of clam biology or is heading toward a potentially lucrative patent, it makes sense to keep data clear and retrievable for both present and future lab members." [@leips2010helm]
"Paper lab notebooks are most commonly seen, valued for their versatility, low expense, and ease of use. The paper notebook type may be determined by the department or institution. Especially at latge companies, there may also be a policy that dictates format, daily signatures by supervisors, and lock-up at night. If there is no requirement, everyone in your lab should use the same kind of bound lab notebook, as determined by you. It should have numbered pages, gridlines, and a tough enough binding that it does not fall apart after a few months of rigorous use on the bench." [@leips2010helm]
"Electronic lab notebooks (ELNs) may be used to enter, store, and analyze data. Coupled with a sturdy notebook computer, they can be used at the bench for notes and alterations to the protocol. Lab members and collaborators can share data and drawings. Reagents can be organized. Shareware versions are available as well as many stand-alone programs that can be tweaked for your needs by the company. A lab that generates high-throughput, automated, or visual data; collaborates with other labs; or has a high personnel turnover should consider using an ELN (Phillips 2006)." [@leips2010helm]
"Laboratory Information Management Systems (LIMSs) are programs that are coupled to a database as well as to lab equipment and facilitate the entry and storage of laboratory data. These systems are expensive and are designed for more large-scale testing and production than for basic research labs. They can be very useful tools: Some can manage protocols, schedule maintenance of lab instruments, receive and process data from multiple instruments, track reagents and samples, print sample labels, do statistical analyses, and be customized to your needs. But although LIMSs can handle a great deal of data analysis, they cannot yet substitute for a lab notebook." [@leips2010helm]
"You should demand a level of care with lab notebooks. Everything in it should be understandable not only to the owner, but to you." [@leips2010helm]
"Whether you check notebooks, or have a lab member present to you with raw data, or stop at everyone's lab bench a few times a week, you must have a feeling for the quality and results of each person's raw data. It is very easy to make assumptions on the basis of the polished data you see at a research meeting, but many a lab member has gone astray with over- or misinterpreted data. By keeping an eye on the raw data, you can be ready to comment on the number of repetitions, alternative experiments, or the implications of a minor result." [@leips2010helm]
"In general, data reuse is most possible when: 1) data; 2) metadata (information describing the data); and 3) information about the process of generating those data, such as code, are all provided." [@goodman2014ten]
"So far we have used filenames without ever saying what a legal name is, so it's time for a couple of rules. First, filenames are limited to 14 characters. Second, although you can use almost any character in a filename, common sense says you should stick to ones that are visible, and that you should avoid characters that might be used with other meanings. ... To avoid pitfalls, you would do well to use only letters, numbers, the period and the underscore until you're familiar with the situation [i.e., characters with pitfalls]. (The period and the underscore are conventionally used to divide filenames into chunks...) Finally, don't forget that case distinctions matter---junk, Junk, and JUNK are three different names." [@kernighan1984unix]
"The [Unix] system distinguishes your file called 'junk' from anyone else's of the same name. The distinction is made by grouping files into directories, rather in the way that books are placed om shelves in a library, so files in different directories can have the same name without any conflict. Generally, each user haas a personal or home directory, sometimes called login directory, that contains only the files that belong to him or her. When you log in, you are 'in' your home directory. You may change the directory you are working in---often called your working or current directory---but your home directory is always the same. Unless you take special action, when you create a new file it is made in your current directory. Since this is initially your home directory, the file is unrelated to a file of the same name that might exist in someone else's directory. A directory can contain other directories as well as ordinary files ... The natural way to picture this organization is as a tree of directories and files. It is possible to move around within this tree, and to find any file in the system by starting at the root of the tree and moving along the proper branches. Conversely, you can start where you are and move toward the root." [@kernighan1984unix]
"The name '/usr/you/junk' is called the pathname of the file. 'Pathname' has an intuitive meaning: it represents the full name of the path from the root through the tree of directories to a particular file. It is a universal rule in the Unix system that wherever you can use an ordinary filename, you can use a pathname." [@kernighan1984unix]
"If you work regularly with Mary on information in her directory, you can say 'I want to work on Mary's files instead of my own.' This is done by changing your current directory with the
cd
command... Now when you use a filename (without the /'s) as an argument tocat
orpr
, it refers to the file in Mary's directory. Changing directories doesn't affect any permissions associated with a file---if you couldn't access a file from your own directory, changing to another directory won't alter that fact." [@kernighan1984unix]"It is usually convenient to arrange your own files so that all the files related to one thing are in a directory separate from other projects. For example, if you want to write a book, you might want to keep all the text in a directory called 'book'." [@kernighan1984unix]
"Suppose you're typing a large document like a book. Logically this divides into many small pieces, like chapters and perhaps sections. Physically it should be divided too, because it is cumbersome to edit large files. Thus you should type the document as a number of files. You might have separate files for each chapter, called 'ch1', 'ch2', etc. ... With a systematic naming convention, you can tell at a glance where a particular file fits into the whole. What if you want to print the whole book? You could say
$ pr ch1.1 ch1.2 ch 1.3 ...
, but you would soon get bored typing filenames and start to make mistakes. This is where filename shorthand comes in. If you say$ pr ch*
the shell takes the*
to mean 'any string of characters,' so ch* is a pattern that matches all filenames in the current directory that begin with ch. The shell creates the list, in alphabetical order, and passes the list topr
. Thepr
command never sees the*
; the pattern match that the shell does in the current directory generates aa list of strings that are passed topr
." [@kernighan1984unix]"The current directory is an attribute of a process, not a person or a program. ... The notion of a current directory is certainly a notational convenience, because it can save a lot of typing, but its real purpose is organizational. Related files belong together in the same directory. '/usr' is often the top directory of a user file system... '/usr/you' is your login directory, your current directory when you first log in. ... Whenever you embark on a new project, or whenever you have a set of related files ... you could create a new directory with
mkdir
and put the files there." [@kernighan1984unix]"Despite their fundamental properties inside the kernel, directories sit in the file system as ordinary files. They can be read as ordinary files. But they can't be created or written as ordinary files---to preserve its sanity and the users' files, the kernel reserves to itself all control over the contents of directories." [@kernighan1984unix]
"A file has several components: a name, contents, and administrative information such as permissions and modifications times. The administrative information is stored in the inode (over the years, the hyphen fell out of 'i-node'), along with essential system data such as how long it is, where on the disc the contents of the file are stored, and so on. ... It is important to understand inodes, not only to appreciate the options on
ls
, but because in a strong sense the inodes are the files. All the directory hierarchy does is provide convenient names for files. The system's name for a file is its i-number: the number of the inode holding the file's information. ... It is the i-number that is stored in the first two bytes of a directory, before the name. ... The first two bytes in each directory entry are the only connection between the name of a file and its contents. A filename in a directory is therefore called a link, because it links a name in the directory hierarchy to the inode, and hence to the data. The same i-number can appear in more than one directory. Therm
command does not actually remove the inodes; it removes directory entries or links. Only when the last link to a file disappears does the system remove the inode, and hence the file itself. If the i-number in a directory entry is zero, it means that the link has been removed, but not necessarily the contents of the file---there may still be a link somewhere else." [@kernighan1984unix]
"The file system is the part of the operating system that makes physical storage media like disks, CDs and DVDs, removable memory devices, and other gadgets look like hierarchies of files and folders. The file system is a great example of the distinction between logical organization and physical implementation; file systems organize and store information on many differet kinds of devices, but the operating system presents the same interface for all of them." [@kernighan2011d]
" A folder contains the names of other folders and files; examining a folder will reveal more folders and files. (Unix systems traditionally use the word directory instead of folder.) The folders provide the organizational structure, while the files hold the actual contents of documents, pictures, music, spreadsheets, web pages, and so on. All the information that you computer holds is stored in the file system and is accessible through it if you poke around. This includes not only your data, but the executable forms of programs (a browser, for example), libraries, device drivers, and the files that make up the operating system itself. ... The file system manages all this information, making it accessible for reading and writing by applications and the rest of the operating system. It coordinates accesses so they are performed efficiently and don't interfere with each other, it keeps track of where data is physically located, and it ensures that the pieces are kept separate so that parts of your email don't mysteriously wind up in your spreadsheets or tax returns." [@kernighan2011d]
"File system services are available through system calls at the lowest level, usually supplemented by libraries to make common operations easy to program." [@kernighan2011d]
"The file system is a wonderful example of how a wide variety of physical systems can be made to present a uniform logical appearance, a hierarchy of folders and files." [@kernighan2011d]
"A folder is a file that contains information about where folders and files are located. Because information about file contents and organization must be perfectly accurate and consistent, the file system reserves to itself the right to manage and maintain the contents of folders. Users and application programs can only change the folder contents implicitly, by making requests of the file system." [@kernighan2011d]
"In fact, folders are files; there's no difference in how they are stored except that the file system is totally responsible for folder contents, and application programs have no direct way to change them. But otherwise, it's just blocks on the disk, all managed by the same mechanisms." [@kernighan2011d]
"A folder entry for this [example] file would contain its name, its size of 2,500 bytes, the date and time it was created or changed, and other miscellaneous facts about it (permissions, type, etc., depending on the operating system). All of that information is visible through a program like Explorer or Finder. The folder entry also contains information about where the file is stored on disk---which of the 100 million blocks [on the example computer's hard disk] contain its bytes. There are different ways to manage that location information. The folder entry could contain a list of block numbers; it could refer to a block that itself contains a list of block numbers; or it could contain the number of the first block, which in turn gives the second block, and so on. ... Blocks need not be physically adjacent on disk, and in fact they typically won't be, at least for large files. A megabyte file will occupy a thousand blocks, and those are likely to be scattered to some degree. The folders and the block lists are themselves stored in blocks..." [@kernighan2011d]
"When a program wants to access an existing file, the file system has to search for the file starting at the root of the file system hierarchy, looking for each component of the file path name in the corresponding folder. That is, if the file is
/Users/bwk/book/book.txt
on a Mac, the file system will search the root of the file system forUsers
, then search within that folder forbwk
, then within that folder forbook
, then within that forbook.txt
. ... This is a divide-and-conquer strategy, since each component of the path narrows the search to files and folders that lie within that folder; all others are eliminated. Thus multiple files can have the same name for some component; the only requirement is that the full path name be unique. In practice, programs and the operating system keep track of the folder that is currenlty in use so searches need not start from the root each time, and the system is likely to cache frequently-used folders to speed up operations." [@kernighan2011d]"When quitting R, the option is given to save the 'workspace image'. The workspace consists of all values that have been created during a session---all of the data values that have been stored in RAM. The workspace is saved as a file called
.Rdata
and then R starts up, it checks for such a file in the current working directory and loads it automatically. This provides a simple way of retaining the results of calculations from one R session to the next. However, saving the entire R workspace is not the recommended approach. It is better to save the original data set and R code and re-create results by running the code again." [@murrell2009introduction]"Project directory organization isn't just about being tidy, but is essential to the way by which tasks are automated across large numbers of files" [@buffalo2015bioinformatics]
"Naming files and directories on a computer matters more than you may think. In transitioning from a graphical user interface (GUI) based operating system to the Unix command line, many folks bring the bad habit of using spaces in file and directory names. This isn't appropriate in a Unix-based environment, because spaces are used to separate arguments in commands. ... Although Unix doesn't require file extensions, including extensions in file names helps indicate the type of each file. For example, a file named osativa-genes.fasta makes it clear that this is a file of sequences in FASTA format. In contrast, a file named osativa-genes could be a file of gene models, notes on where these Oryza sativa genes came from, or sequence data. When in doubt, explicit is always better than implicit when it comes to filenames, documentation, and writing code." [@buffalo2015bioinformatics]
"Scripts and analyses often need to refer to other files (such as data) in your project hierarchy. This may require referring to parent directories in you directory's hierarcy ... In these cases, it's important to always use relative paths ... rather than absolute paths ... As long as your internal project directory structure remains the same, these relative paths will always work. In contrast, absolute paths rely on you particular user account and directory structures details above the project directory level (not good). Using absolute paths leaves your work less portable between collaborators and decreases reproducibility." [@buffalo2015bioinformatics]
"Document the origin of all data in your project directory. You need to keep track of where data was downloaded from, who gave it to you, and any other relevant information. 'Data' doesn't just refer to your project's experimental data---it's any data that programs use to create output. This includes files your collaborators send you from their separate analyses, gene annotation tracks, reference genomes, and so on. It's critical to record this important data about you're data, or metadata. For example, if you downloaded a set of genic regions, record the website's URL. This seems like an obvious recommendation, but ocuntless times I've encountered an analysis step that couldn't be easily reproduced because someone forgot to record the data's source." [@buffalo2015bioinformatics]
"Record data version information. Many databases have explicit release numbers, version numbers, or names (e.g., TAIR10 version of genome annotation for Arabidopsis thaliana, or Wormbase release WS231 for Caenorhabditis elegans). It's important to record all version information in your documentation, including minor version numbers." [@buffalo2015bioinformatics]
"Describe how you downloaded the data. For example, did you use MySQL to download a set of genes? Or the USCS Genome Browser? THese details can be useful in tracking down issues like when data is different between collaborators." [@buffalo2015bioinformatics]
"Bioinformatics projects involve many subprojects and subanalyses. For example, the quality of raw experimental data should be assessed and poor quality regions removed before running it through bioinformatics tools like aligners or assemblers. ... Even before you get to actually analyzing the sequences, your project directory can get cluttered with intermediate files. Creating directories to logically separate subprojects (e.g., sequencing data quality improvement, aligning, analyzing alignment results, etc.) can simplify complex projects and help keep files organized. It also helps reduce the risk of accidentally clobbering a file with a buggy script, as subdirectories help isolate mishaps. Breaking a project down into subprojects and keeping these in separate subdirectories also makes documenting your work easier; each README pertains to the directory it resides in. Ultimately, you'll arrive at your own project organization system that works for you; the take-home point is: leverage directories to help stay organized." [@buffalo2015bioinformatics]
"Because lots of daily bioinformatics work involves file processing, programmatically accessing files makes our job easier and eliminates mistakes from mistyping a filename or forgetting a sample. However, our ability to programmatically access files with wildcards (or other methods in R or Python) is only possible when our filenames are consistent. While wildcards are powerful, they're useless if files are inconsistently named. ... Unfortunately, inconsistent naming is widespread across biology, and is the source of bioinformaticians everywhere. Collectively, bioinformaticians have probably wasted thousands of hours fighting others' poor naming schemes of files, genes, and in code." [@buffalo2015bioinformatics]
"Another useful trick is to use leading zeros ... when naming files. This is useful because lexicographically sorting files (as
ls
does) leads to correct ordering. ... Using leading zeros isn't just useful when naming filenames; this is also the best way to name genes, transcripts, and so on. Projects like Ensembl use this naming scheme in naming their genes (e.g., ENSG00000164256)." [@buffalo2015bioinformatics]"In order to read or write a file, the first thing we need to be able to do is specify which file we want to work with. Any function that works with a file requires a precise description of the name of the file and the location of the file. A filename is just a character value..., but identifying the location of a file can involve a path, which describes a location on a persistent storage medium, such as a hard drive." [@murrell2009introduction]
"A regular expression consists of a mixture of literal characters, which have their normal meaning, and metacharacters, which have a special meaning. The combination describes a pattern that can be used to find matches amongst text values." [@murrell2009introduction]
"A regular expression may be as simple as a literal word, such as
cat
, but regular expressions can also be quite complex and express sophisticated ideas, such as[a-z]{3,4}[0-9]{3}
, which describes a pattern consisting of either three or four lowercase letters followed by any three digits." [@murrell2009introduction]"... it's important to mind R's working directory. Scripts should not use
setwd()
to set their working directory, as this is not portable to other systems (which won't have the same directory structure). For the same reason, use relative paths ... when loading in data, and not absolute pathers... Also, it's a good idea to indicate (either in comments or a README file) which directory the user should set as their working directory." [@buffalo2015bioinformatics]"Centralize the location of the raw data files and automate the derivation of intermediate data. Store the input data on a centralized file server that is profesionally backed up. Mark the files as read-only. Have a clear and linear workflow for computing the derived data (e.g., normalized, summarized, transformed, etc.) from the raw files, and store these in a separate directory. Anticipate that this workflow will need to be run several times, and version it. Use the
BiocFileCache
package to mirror these files on your personal computer. [footnote: A more basic alternative is the rsync utility. A popular solution offered by some organizations is based on ownCloud. Commercial options are Dropbox, Google Drive and the like]." [@holmes2018modern]"Using an RCS [revision control system] has changed how I work. ... a day's work is no longer a featureless slog toward the summit, but a sequence of small steps. What one feature could I add? What one problem could I fix? Once a step is made and you are sure your code base is in a safe and clean state, commit a revision, and if your next step turns out disastrously, you can fall back to the revision you just committed instead of starting from the beginning." [@klemens201421st]
With version control, "Our filesystem now has a time dimension. We can query the RCS's repository of file information to see what a file looked like last week and how it changed from then to now. Even without the other powers, I have found that this alone makes me a more confident writer." [@klemens201421st]
"The most rudimentary means of revision control is via
diff
andpatch
, which are POSIX-standard and therefore most certainly on your system." [@klemens201421st]"Git is a C program like any other, and is based on a small set of objects. The key object is the commit object, which is akin to a unified diff file. Given a previous commit object and some changes from that baseline, a new commit object encapsulates the information. It gets some support from the index, which is a list of the changes registered since the last commit object, the primary use of which will be in generating the next commit object. The commit objects link together to form a tree much like any other tree. Each commit object will have (at least) one parent commit object. Stepping up and down the tree is akin to using
patch
andpatch -R
to step among versions." [@klemens201421st]"Having a backup system organized enough that you can delete code with confidence and recover as needed will already make you a better writer." [@klemens201421st]
"GitHub issues are a great way to keep track of bugs, tasks, feature requests, and enhancements. While classical issue trackers are primarily intended to be used as bug trackers, in contrast, GitHub issue trackers follow a different philosophy: each tracker has its own section in every repository and can be used to trace bugs, new ideas, and enhancements by using a powerful tagging system. The main objective of issues in GitHub is promoting collaboration and providing context using cross-references. Raising an issue does not require lengthy forms to be completed. It only requires a title and, preferably, at least a short description. Issues have very clear formatting and provide space for anyone with a GitHub account to provide feedback. ... Additional elements of issues are (i) color-coded labels that help to categorize and filter issues, (ii) milestones, and (iii) one assignee responsible for working on the issue." [@perez2016ten]
"As another illustration of issues and their generic and wide application, we and others used GitHub issues to discuss and comment on changes in manuscripts and address reviewers' comments." [@perez2016ten]
"A good approach is to store at least three copies in at least two geographically distributed locations (e.g., original location such as a desktop computer, an external hard drive, and one or more remote sites) and to adopt a regular schedule for duplicating the data (i.e., backup)." [@michener2015ten]
"One study surveyed neuroscience researchers at a UK institute. "The backup 'rule of three' states that for a file to be sufficiently backed up it should be kept in three separate locations using two different types of media with one offsite backup. A lack of an adequate backup solution could mean permanently lost data, effort and time. In this research, more than 82% of the respondents seemed to be unaware of suitable backup procedures to protect their data. Some respondents kept a single backup of work on external hard disks. Others used the Universities local networked servers as their means of backup." [@altarawneh2017pilot]
"Departmental or institutional servers provide an area to store large files such as graphics files as well as e-mail and documents. Such systems will usually have frequent routine backups of all data, often onto optical disks. They might also encrypt the data, which makes it less able to be hacked. This is the most dependable form of long-term storage." [@leips2010helm]
"It's very important to keep a project notebook containing detailed information about the chronology of your computational work, steps you've taken, information about why you've made decisions, and of course all pertinent information to reproduce your work. Some scientists do this in a handwritten notebook, others in Microsoft Word documents. As with README files, bioinformaticians usually like keeping project notebooks in simple plain-text because these can be read, searched, and edited from the command line and across network connections to servers. Plain text is also a future-proof format: plain-text files written in the 1960s are still readable today, whereas files from word processors only 10 years old can be difficult or impossible to open and edit. Additionally, plain text project notebooks can also be put under version control ... While plain-text is easy to write in your text editor, it can be inconvenient for collaborators unfamiliar with the command line to read. A lightweight markup language called Markdown is a plain-text format that is easy to read and painlessly incorporated into typed notes, and can also be rendered to HTML or PDF." [@buffalo2015bioinformatics]
"Markdown is just plain-text, which means that it's portable and programs to edit and read it will exist. Anyone who's written notes or papers in old versions of word processors is likely familiar with the hassle of trying to share or update out-of-date proprietary formats. For these reasons, Markdown makes for a simple and elegant notebook format." [@buffalo2015bioinformatics]
"Information, whether data or computer code, should be organized in such a way that there is only one copy of each important unit of information." [@murrell2009introduction]
"A typical encounter with Bioconductor (Box 1) starts with a specific scientific need, for example, differential analysis of gene expression from an RNA-seq experiment. The user identifies the appropriate documented workflow, and because the workflow contains functioning code, the user runs a simple command to install the required packages and replicate the analysis locally. From there, she proceeds to adapt the workflow to her particular problem. To this end, additional documentation is available in the form of package vignettes and manual pages." [@huber2015orchestrating]
"Case study: high-throughput sequencing data analysis. Analysis of large-scale RNA or DNA sequencing data often begins with aligning reads to a reference genome, which is followed by interpretation of the alignment patterns. Alignment is handled by a variety of tools, whose output typically is delivered as a BAM file. The Bioconductor packages Rsamtools and GenomicAlignments provide a flexible interface for importing and manipulating the data in a BAM file, for instance for quality assessment, visualization, event detection and summarization. The regions of interest in such analyses are genes, transcripts, enhancers or many other types of sequence intervals that can be identified by their genomic coordinates. Bioconductor supports representation and analysis of genomic intervals with a 'Ranges' infrastructure that encompasses data structures, algorithms and utilities including arithmetic functions, set operations and summarization (Fig. 1). It consists of several packages including IRanges, GenomicRanges, GenomicAlignments, GenomicFeatures, VariantAnnotation and rtracklayer. The packages are frequently updated for functionality, performance and usability. The Ranges infrastructure was designed to provide tools that are convenient for end users analyzing data while retaining flexibility to serve as a foundation for the development of more complex and specialized software. We have formalized the data structures to the point that they enable interoperability, but we have also made them adaptable to specific use cases by allowing additional, less formalized userdefined data components such as application-defined annotation. Workflows can differ vastly depending on the specific goals of the investigation, but a common pattern is reduction of the data to a defined set of ranges in terms of quantitative and qualitative summaries of the alignments at each of the sites. Examples include detecting coverage peaks or concentrations in chromatin immunoprecipitation–sequencing, counting the number of cDNA fragments that match each transcript or exon (RNA-seq) and calling DNA sequence variants (DNA-seq). Such summaries can be stored in an instance of the class GenomicRanges." [@huber2015orchestrating]
"Visualization is essential to genomic data analysis. We distinguish among three main scenarios, each having different requirements. The first is rapid interactive data exploration in 'discovery mode.' The second is the recording, reporting and discussion of initial results among research collaborators, often done via web pages with interlinked plots and tool-tips providing interactive functionality. Scripts are often provided alongside to document what was done. The third is graphics for scientific publications and presentations that show essential messages in intuitive and attractive forms. The R environment offers powerful support for all these flavors of visualization—using either the various R graphics devices or HTML5-based visualization interfaces that offer more interactivity---and Bioconductor fully exploits these facilities. Visualization in practice often requires that users perform computations on the data, for instance, data transformation and filtering, summarization and dimension reduction, or fitting of a statistical model. The needed expressivity is not always easy to achieve in a point-and-click interface but is readily realized in a high-level programming language. Moreover, many visualizations, such as heat maps or principal component analysis plots, are linked to mathematical and statistical models---for which access to a scientific computing library is needed." [@huber2015orchestrating]
" It can be surprisingly difficult to retrace the computational steps performed in a genomics research project. One of the goals of Bioconductor is to help scientists report their analyses in a way that allows exact recreation by a third party of all computations that transform the input data into the results, including figures, tables and numbers. The project’s contributions comprise an emphasis on literate programming vignettes, the BiocStyle and ReportingTools packages, the assembly of experiment data and annotation packages, and the archiving and availability of all previously released packages. ... Full remote reproducibility remains a challenging problem, in particular for computations that require large computing resources or access data through infrastructure that is potentially transient or has restricted access (e.g., the cloud). Nevertheless, many examples of fully reproducible research reports have been produced with Bioconductor." [@huber2015orchestrating]
"Using Bioconductor requires a willingness to modify and eventually compose scripts in a high-level computer language, to make informed choices between different algorithms and software packages, and to learn enough R to do the unavoidable data wrangling and troubleshooting. Alternative and complementary tools exist; in particular, users may be ready to trade some loss of flexibility, automation or functionality for simpler interaction with the software, such as by running single-purpose tools or using a point-and-click interface. Workflow and data management systems such as Galaxy and Illumina BaseSpace provide a way to assemble and deploy easy-touse analysis pipelines from components from different languages and frameworks. The IPython notebook provides an attractive interactive workbook environment. Although its origins are with the Python programming language, it now supports many languages, including R. In practice, many users will find a combination of platforms most productive for them." [@huber2015orchestrating]
"A lab manual is perhaps the best way to inform new lab members of the ins and outs of the lab and to keep all members updated on protocols and regulations. What could be included in a lab manual? Anything you do not want to explain over and over, anything that will make the lab more functional and that can make life easier for yourself and lab members." [@leips2010helm]
"Most PIs wish the labs were more organized, but it is not a huge priority, that is, until the first student leaves and no one can find a particular cell line in the freezer boxes. Resolutions are made, the crisis passes, and all goes on as before until the next person leaves. Although it is probably inevitable that there will be some confusion when a long-time lab member moves on, an organized lab will not be as affected as an unorganized one." [@leips2010helm]
"LaTeX gives you output documents that look great and have consistent cross-references and citations. Much of your output document is created automatically and much is done behind the scenes. This gives you extra time to think about the ideas you want to present and how to communicate those ideas in an effective way. " [@van2012latex]
"LaTeX provides state-of-the-art typesetting" [@van2012latex]
"Many conferences and publishers accept LaTeX. In addition they provide classes and packages that guarantee documents conforming to the required formatting guidelines." [@van2012latex]
"LaTeX automatically numbers your chapters, sections, figures, and so on." [@van2012latex]
"LaTeX has excellent bibliography support. It supports consistent citations and an automatically generated bibliography with a consistent look and feel. The style of citations and the organisation of the bibliography is configurable." [@van2012latex]
"LaTeX is very stable, free, and available on many platforms." [@van2012latex]
"LaTeX was written by Leslie Lamport as an extension of Donald Knuth's TeX program. It consists of a Turing-complete procedural markup language and a typesetting processor. The combination of the two lets you control both the visual presentation as well as the content of your documents." [@van2012latex]
"Roughly speaking LaTeX is built on top of TeX. This adds extra functionality to TeX and makes writing your documents much easier." [@van2012latex]
"To create a perfect output file and have consistent cross-references and citations, latex also writes information to and reads information from auxiliary files. Auxiliary files contain information about page numbers of chapters, sections, tables, figures, and so on. Some auxiliary files are generated by latex itself (e.g., aux files). Others are generated by external programs such as bibtex, which is a program that generates information for the bibliography. When an auxiliary file changes then LaTeX may be out of sync. You should rerun latex when this happens." [@van2012latex]
"LaTeX is a markup language and document preparation system. It forces you to focus on the content and not on the presentation. In a LaTeX program you write the content of your document, you use commands to provide markup and automate tasks, and you import libraries." [@van2012latex]
"The main purpose of [LaTeX] commands is to provide markup. For example, to specify the author of the document you write
\author{<author name>}
. The real strength of LaTeX is that it also is a Turing-complete programming language, which lets you define your own commands. These commands let you do real programming and give you ultimate control over the content and the final visual presentation. You can reuse your commands by putting them in a library." [@van2012latex]"The paragraph is one of the most important basic building blocks of your document. The paragraph formation rules depend on how latex treats spaces, empty lines, and comments. Roughly, the rules are as follows. In its default model, latex treats a sequence of one or more spaces as a single space. The end of the line is the same as a space. However: An empty line acts as an end-of-paragraph specifier..." [@van2012latex]
Including executable code in other languages.
In your RMarkdown documents, you include executable code in special sections ("chunks") that are separated from the regular text using a special combination of characters, as described earlier in this module and in the previous module. By default, in Rmarkdown files the code in these chunks are executed using the R programming language. However, you can also include executable code in a number of other programming languages. For example, you could set some code chunks to run Python, others to run Julia, and still others (e.g., bash) to run a shell script.
This can be very helpful if you have steps in you Python that use code in different languages. For example, there may be a module in Python that works well for an early step in your data preprocessing, and then later steps that are easier with general R functions. This presents no problem in creating an RMarkdown data pre-processing protocol, as you can include different steps using different languages.
The program that is used to run the code in a specific chunk is called the
"engine" for that chunk [ref---R Markdown def guide]. You can change the engine
by changing the combination of characters you use to demarcate the start of
executable code. When you are including a chunk of R code, you mark it off
starting with the character combination ```r
. You change this to
give the engine you would like to use---for example, you would include a chunk
of Python code using ```{python}
[ref---R Markdown def guide]. When
your RMarkdown document is rendered, your computer will use the specified
software to run each code chunk. Of course, to run that piece of code, your
computer must have that type of software installed and available. For example,
if you include a chunk of code that you'd like to run with a Python engine, you
must have Python on your computer.
While you can use many different software programs as the engine for each code chunk, there are a few limitations with some programs. For many open-source software programs, the results from running a chunk of code with that engine will be available for later code chunks that also use that engine to use as an input [ref---R Markdown def guide]. This is not the case, however, for most of the available engines. For example, if you use the SAS software program as the engine for one of your code chunks, the output from running that code will not be available to input to later code in the document.
Caching code results.
Some code can take a while to run, particularly if it is processing very large datasets. By default, RMarkdown will re-run all code in the document every time you render it. This is usually the best set-up, since it allows you to confirm that the code is all executing as desired each time the code is rendered. However, if you have steps that take a long time, this can make it so the RMarkdown document takes a long time to render each time you render it.
To help with this problem, RMarkdown has a system that allows you to cache results from some or all code chunks in the document. This is a really nice system---it will check the inputs to that part of the code each time the document is run. If those inputs have changed, it will take the time to re-run that piece of code, to use the updated inputs. However, if the inputs have not changed since the last time the document was rendered, then the last results for that chunk of code will be pulled from memory and used, without re-running the code in that chunk. This saves time most of the times that you render the document, while taking the time to re-run the code when necessary, because the inputs have changed and so the outputs may be different.
There are some downsides to caching. For example, caching can increase the storage space it takes to save Rmarkdown work, as intermediate results are saved. However, if some of your code is very time-intensive to run, it may make sense to look into caching options with Rmarkdown. For more on caching with Rmarkdown documents, see this section of the R Markdown Cookbook [ref].
Outputting to other formats.
You can use RMarkdown to create documents other than traditional reports. Scientists might find the outputs of presentations and posters particularly useful.
RMarkdown has allowed a pdf slide output for a long time. This output leverages the "beamer" format from LaTeX. You can create a series of presentation slides in RMarkdown, using Markdown to specify formatting, and then the document will be rendered to pdf slides. These slides can be shown using pdf viewer software, like Adobe Acrobat, set either to full screen or to the presentation option. More recently, capability has been added to RMarkdown that allows you to create PowerPoint slides. Again, you will start from an RMarkdown document, using Markdown syntax to do things like divide content into separate slides. Regardless of the output format you choose (pdf slides or PowerPoint), the code to generate figures and tables in the presentation can be included directly in the RMarkdown file, so it is re-run with the latest data each time you render the presentation.
It is also possible to use RMarkdown to create scientific posters, although this
is a bit less common and there are fewer tutorial resources with instructions on
doing this. To find out more about creating scientific posters with Rmarkdown,
you can start by looking at the documentation for some R packages that have been
created for this process. Two include the posterdown
package [ref], with
documentation available at
https://reposhub.com/python/miscellaneous/brentthorne-posterdown.html, and the
pagedown
package [ref], with documentation available at
https://github.com/rstudio/pagedown. There are also some blog posts available
where researchers describe how they created a poster with Rmarkdown; one
thorough one is "How to make a poster in R" by Wei Yang Tham, available at
https://wytham.rbind.io/post/making-a-poster-in-r/.
This idea of customizing Rmarkdown documents has evolved in another useful way through the idea of Rmarkdown templates. These are templates that are customized---often very highly customized---while allowing you to write the content using Rmarkdown. One area where these templates can be very useful to scientists is with article templates that are customized for specific scientific journals. A number of scientific journals have created LaTeX templates that can be used when writing drafts to submit to the journal. These templates produce a draft that is nicely formatted, following all the journal's guidelines for submission, and in some cases formatted as the final article would be for the journal. These templates have existed for a long time, particularly for journals in fields in which LaTeX is commonly used for document formatting, including physics and statistics. However, the templates traditionally required you to use LaTeX, which is a complex markup language with a high threshold for learning to use it.
Now, many of these article templates have been wrapped within an Rmarkdown template, allowing you to leverage them while writing all the content in Rmarkdown syntax, and allowing you to include executable code directly in the draft. An example of the first page of an article created in Rmarkdown using one of these article templates is shown in Figure \@ref(fig:rticleexample).
These Rmarkdown templates are typically available through R packages, which you
can install on your computer in the same way you would install any R package
(i.e., with the install.packages
function). Many journal article templates are
available through the rticles
package [ref], including the template used to
create the manuscript shown in Figure \@ref(fig:rticleexample). You can find
more information about the rticles
package on its GitHub page, at
https://github.com/rstudio/rticles. There is also a section in the book R
Markdown: A Definitive Guide [ref] on writing manuscripts for scientific
journals using Rmarkdown, available online at
https://bookdown.org/yihui/rmarkdown/rticles-templates.html.
As a similar idea, you can created parameterized RMarkdown documents. These are a simple way to create a kind of template for reports in your laboratory. You can create these is a similar way to regular RMarkdown documents, but they include an area where you can change some inputs each time you render the document. There is a section on parameterized reports in R Markdown: A Definitive Guide [ref].
You can also use RMarkdown to create much larger outputs, compared to simpler reports and protocols. RMarkdown can now be used to create very large and dynamic documents, including online books (which can also be rendered to pdf versions suitable for printing), dashboard-style websites, and blogs. Once members of your research group are familiar with the basics of RMarkdown, you may want to explore using it to create these more complex outputs. The book format divides content into chapters and special sections like appendices and references. It includes a table of contents based on weblinks, so readers can easily navigate the content. It uses a book format as its base that allows readers to do things like change the font size and search the book text for keywords. The book containing these modules is one example of using bookdown. If you would like to explore using bookdown to create online books based on Rmarkdown files, there are a number of resources available. There is an online book available with extensive instructions on using this package, available at https://bookdown.org/yihui/bookdown/. There is also a helpful website with more details on this package, https://bookdown.org/. The website include a gallery of example books created with bookdown https://bookdown.org/home/archive/, which you can use to explore the types of books that can be created.
You can also use Rmarkdown documents to create webpages, with pages included for blogs. This format allows you to create a very attractive website that includes a blog section, where you can write and regularly post new blogs, keeping the site dynamic. It is a nice entry point to developing and maintaining a website for people who are learning to code in R but otherwise haven't done much coding, as you can do all the steps within RStudio. There are templates for these blogs that are appropriate for creating personal or research group websites for academics. These websites can be created to highlight the research and people in your research lab. You can encourage students and postdocs to create personal sites, to raise the profile of their research. In the past, we have even used one as a central, unifying spot for a group study, with students contributing blog posts as their graded assignment (https://kind-neumann-789611.netlify.app/). To learn how to create websites with blogs, you can check the book blogdown: Creating Websites with R Markdown [ref], which is available both in print and free online at https://bookdown.org/yihui/blogdown/. This process takes a bit of work to initially get the website set up, but then allows for easy and straightforward maintenance.
Finally, a simpler way to make basic web content with RMarkdown is through their flexdashboard format. This format creates a smaller website that is focused on sharing data results---you can see a gallery of examples at https://rmarkdown.rstudio.com/flexdashboard/examples.html. This format is excellent for creating a webpage that allows users to view complex, and potentially interactive, results from data you've collected. It can be particularly helpful for groups that need to quickly communicate regularly updated data to viewers. During the COVID-19 pandemic, for example, many public health departments maintained dashboard-style websites to share evolving data on COVID-19 in the community. Using RMarkdown in this case has the key advantage of allowing you to easily update the dashboard webpage as you get new or updated data, since it is easy to re-run any data processing, analysis, and visualization code in the document. To learn how to use RMarkdown to create dashboard websites, you can check out RStudio's flexdashboard site at https://rmarkdown.rstudio.com/flexdashboard/index.html. There is also guidance available in one of the chapters of R Markdown: The Definitive Guide [ref]: https://bookdown.org/yihui/rmarkdown/dashboards.html.
More complex formatting.
As mentioned earlier, Markdown is a fairly simple markup language. Occasionally, this simplicity means that you might not be able to create fancier formatting that you might desire. There is a method that allows you to work around this constraint in RMarkdown.
In Rmarkdown documents, when you need more complex formatting, you can shift into a more complex markup language for part of the document. Markup languages like LaTeX and HTML are much more expressive than Markdown, with many more formatting choices possible. For example, there is functionality within LaTeX and HTML to create much more complex tables than in Markdown. However, there is a downside---when you include formatting specified in these more complex markup languages, you will limit the output formats that you can render the document to. For example, if you include LaTeX formatting within an RMarkdown document, you must output the document to PDF, while if you include HTML, you must output to an HTML file. Conversely, if you stick with the simpler formatting available through the Markdown syntax, you can easily switch the output format for your document among several choices.
The R Markdown Cookbook [ref] includes chapters on how to customize Rmarkdown output through LaTeX (https://bookdown.org/yihui/rmarkdown-cookbook/latex-output.html) and HTML (https://bookdown.org/yihui/rmarkdown-cookbook/html-output.html). These customizations can include creating custom formats for the entire document (for example, you can customize the appearance of a whole HTML document by customizing the CSS style file for the document). They can also include smaller-level customizations, like changing the citation style that is used in conjunction with a BibTeX file by adding to the preamble for LaTeX output.
One area of customization that is particularly useful and simple to implement is
with customized tables. The Markdown syntax can create very simple tables, but
does not allow the creation of more complex tables. There is an R package called
kableExtra
[ref] that allows you to create very attractive and complex tables
in RMarkdown documents.
This package leverages more of the power of underlying markup languages, rather than the simpler Markdown language. If you remember, Markdown is pretty easy to learn because it has a somewhat limited set of special characters and special markings that you can use to specify formatting in your output document. This basic set of functionality is often all you need, but for complex table formatting, you will need more. There is much more available in the deeper markup languages that you can use specifically to render pdf documents (software derived from TeX) and the one that you can use specifically to render HTML (the HTML markup language). As a result, you will need to create RMarkdown files that are customized to a single output format (pdf or HTML) to take advantage of this package.
You can install this package the same as any other R package from CRAN, using
install.packages
. You will need to use then need to use library("kableExtra)
within your RMarkdown document before you use functions from the package. The
kableExtra
package is extensively documented through two vignettes that come
with package, one if the output will be in pdf
(https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_pdf.pdf)
and one if it will be in HTML
(https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html).
There is also information on using kableExtra
available through R Markdown
Cookbook [ref]: https://bookdown.org/yihui/rmarkdown-cookbook/kableextra.html.
"WordPerfect was always the best word processor. Because it allowed for insight into its very structure. You could hit a certain key combination and suddenly the screen would split and you'd reveal the codes, the bolds and italics, and so forth, that would define your text when it was printed. It was beloved of legal secretaries and journalists alike. Because when you work with words, at the practical, everyday level, the ability to look under the hood is essential. Words are not simple. And WordPerfect acknowledged that. Microsoft Word did not. Microsoft kept insisting that what you saw on your screen was the way things were, and if your fonts just kept sort of randonmly changing, well, you must have wanted it that way. Then along came HTML, and what I remember most was that sense of being back inside the file. Sure, HTML was a typographic nightmare, a bunch of unjustified Times New Roman in 12 pt on screens with chiclet-size pixels, but under the hood you could see all the pieces. Just like WordPerfect. That transparency was a wonderful thing, and it renewed computing for me." [@ford2015on]
"TeX was created by Donald E. Knuth, a professor at Stanford University who has achieved international renown as a mathematician and computer scientist. Knuth also has an aesthetic sense uncommon in his field, and his work output is truly phenomenal. TeX is a happy byproduct of Knuth's mammoth enterprise, The Art of Computer Programming. This series of reference books, designed to cover the whole gamut of programming concepts and techniques, is a sine qua non for all computer scientists." [@seroul2012beginner]
"Roughly speaking, text processors fall into two categories: (1) WYSIWYG systems: what you see is what you get. You see on the screen at all times what the printed document will look like, and what you type has immediate effect on the appearance of the document. (2) markup systems, where you type your text interspersed with formatting instructions, but don't see their effect right away. You must run a program to examine the resulting image, whether on paper or on the screen. In computer science jargon, markup systems must compile the source file you type. WYSIWYG systems have the obvious advantage of immediate feedback, but they are not very precise: what is acceptable at a resolution of 300 dots per inch, for an ephemeral publication such as a newsletter or flier, is no longer so for a book that will be phototypeset at high resolution. The human eye is extraordinarily sensitive: you can be bothered by the appearance of a text without being able to pinpoint why, just as you can tell when someone plays the wrong note in an orchestra, without being able to identify the CUlprit. One quickly leams in typesetting that the beauty, legibility and comfortable reading of a text depend on minute details: each element must be placed exactly right, within thousandths of an inch. For this type of work, the advantage of immediate feedback vanishes: fine details of spacing, alignment, and so on are much too small to be discernible at the screen's relatively low resolution, and even if it such were not the case, it would still be a monumental chore to find the right place for everything by hand. For this reason it is not surprising that in the world of professional typesetting markup systems are preferred. They automate the task of finding the right place for each character with great precision. Naturally, this approach is less attractive for beginners, since one can't see the results as one types, and must develop a feeling for what the system will do. But nowadays, you can have the best of both worlds by using a markup system with a WYSIWYG front end; we'll talk about such front ends for TEX later on. TEX was developed in the late seventies and early eighties, before WYSIWYG systems were widespread. But were it to be redesigned now, it would still be a markup language. To give you an idea of the precision with which TEX operates: the internal unit it uses for its calculations is about a hundred times smaller than the wavelength of visible light! (That's right, a hundred times.) In other words, any round-off error introduced in the calculations is invisible to the naked eye." [@seroul2012beginner]
"You should be sure to understand the difference between a text editor and a text processor. A text processor is a text editor together with formatting software that allows you to switch fonts, do double columns, indent, and so on. A text editor puts your text in a file on disk, and displays a portion of it on the screen. It doesn't format your text at all. We insist on the difference because those accustomed to WYSIWYG systems are often not aware of it: they only know text processors. Where can you find a text editor? Just about everywhere. Every text processor includes a text editor which you can use. But if you use your text processor as a text editor, be sure to save your file using a 'save ASCII' or 'save text only' option, so that the text processor's own formatting commands are stripped off. If you give TEX a file created without this precaution, you'll get garbage, because TEX cannot digest your text processor's commands." [@seroul2012beginner]
"TeX enabled authors to encode their precise intent into their manuscripts: This block of text is a computer program, while this word is a keyword in that program. The language it used, called TeX markup, formalized the slow, error-prone communication that is normally carried out with the printer over repeated galley proofs." [@apte2019lingua]
"The idea of writing markup inside text wasn’t especially novel; it has been used from 1970’s runoff (the UNIX family of printer-preparation utilities) to today’s HTML tags. TeX was new in that it captured key concepts necessary for realistic typesetting and formalized them." [@apte2019lingua]
"With these higher-level commands, the free TeX engine, and the LaTeX book, the use of TeX exploded. The macro file has since evolved and changed names, but authors still typically run the program called latex or its variants. Hence, most people who write TeX manuscripts know the program as LaTeX and the commands they use as LaTeX commands." [@apte2019lingua]
"The effect of LaTeX on scientific and technical publishing has been profound. Precise typesetting is critical, particularly for conveying concepts using chemical and mathematical formulas, algorithms, and similar constructs. The sheer volume of papers, journals, books, and other publications generated in the modern world is far beyond the throughput possible via manual typesetting. And TeX enables automation without losing precision. Thanks to LaTeX, book authors can generate camera-ready copy on their own. Most academic and journal publishers accept article manuscripts written in LaTeX, and there’s even an open archive maintained by Cornell University where authors of papers in physics, chemistry, and other disciplines can directly submit their LaTeX manuscripts for open viewing. Over 10,000 manuscripts are submitted to this archive every month from all over the world." [@apte2019lingua]
"For many users, a practical difficulty with typesetting using TeX is preparing the manuscripts. When TeX was first developed, technical authors were accustomed to using plain-text editors like WordStar, vi, or Emacs with a computer keyboard. The idea of marking up their text with commands and running the manuscript through a typesetting engine felt natural to them. Today’s typesetters, particularly desktop publishers, have a different mental model. They expect to see the output in graphical form and then to visually make edits with a mouse and keyboard, as they would in any WYSIWYG program. They might not be too picky about the quality of the output, but they appreciate design capabilities, such as the ability to flow text around curved outlines. Many print products are now produced with tools like Microsoft Word for this very reason. TeX authors cannot do the same work as easily." [@apte2019lingua]
"Poor documentation can lead to irreproducibility and serious errors. There's a vast amount of lurking complexity in bioinformatics work: complex workflows, multiple files, countless program parameters, and different software versions. The best way to prevent this complexity from causing problems is to document everything extensively. Documentation also makes your life easier when you need to go back and rerun an analysis, write detailed methods about your steps for a paper, or find the origin of some data in a directory." [@buffalo2015bioinformatics]
"Scatterplots are useful for visualizing treatment-response comparisons ..., assocations between variables ..., or paired data (e.g., a disease biomarker in several patients before and after treatment)." [@holmes2018modern]
"Sometimes we want to show the relationshiips between more than two variables. Obvious choices for including additional dimensions are plot symbol shapes and colors. ... Another way to show additional dimensions of the data is to show multiple plots that result from repeatedly subsetting (or 'slicing') the data based on one (or more) of the variables, so that we can visualize each part separately. This is called faceting and it enables us to visualize data in up to four or five dimensions. So we can, for instance, investigate whether the observed patterns among the other variables are the same or different across the range of the faceting variable." [@holmes2018modern]
"You can add an enormous amount of information and expressivity by making your plots interactive. ... The package
ggvis
is an attempt to extend the good features ofggplot2
into the realm of interactive graphics. In contrast toggplot2
, which produces graphics into R's traditional graphics devices (PDF, PNG, etc.),ggvis
builds upon a JavaScript infrastructure called Vega, and its plots are intended to be viewed in an HTML browser." [@holmes2018modern]"Heatmaps are a powerful way of visualizing large, matrix-like datasets and providing a quick overview of the patterns that might be in the data. There are a number of heatmap drawing functions in R; one that is convenient and produces good-looking output is the function
pheatmap
from the eponymous package." [@holmes2018modern]"Plots in which most points are huddled in one area, with much of the available spaces sparesly populated, are difficult to read. If the histogram of the marginal distribution of a variable has a sharp peak and then long tails to one or both sides, transforming the data can be helpful. ... The plots in this chapter that involve microarray data use the logarithmic transformation [footnote: 'We used it implicitly, since the data in the
ExpressionSet
objectx
already came log-transformed']---not only in scatterplots... for the x- and y-coordinates but also ... for the color scale that represents the expression fold change. The logarithm transformation is attractive because it has a definite meaning---a move up or down by the same amount on a log-transformed scale corresponds to the same multiplicative change on the original scale: log(ax) = log a + log x. Sometimes, however, the logarithm is not good enough, for instance when the data include zero or negative values, or when even on the logarithmic scale the data distribution is highly uneven." [@holmes2018modern]"To visualize genomic data, in addition to the general principles we have discussed in this chapter, there are some specific considerations. The data are usually associated with genomic coordinates. In fact, genomic coordinates offer a great organizing principle for the integration of genomic data. ... The main challenge of genomic data visualization is the size of the genomes. We need visualization at multiple scales, from whole genome down to the nucleotide level. It should be easy to zoom in and out, and we may need different visualization strategies for the different size scales. It can be convenient to visualize biological molecules (genomes, genes, transcripts, proteins) in a linear manner, although their embedding in the physical world can matter (a great deal)." [@holmes2018modern]
"Visualizing the data, either 'raw' or along the various steps of processing, summarization, and inference, is one of the most important activities in applied statistics and, indeed, in science. It sometimes gets short shrift in textbooks since there is not much deductive theory. However, there are many good (and bad) practices, and once you pay attention to it, you will quickly see whether a certain graphic is effective in conveying its message, or what choices you could make to create powerful and aesthetically attractive data visualizations." [@holmes2018modern]
"Most scholarly works have citations and a bibliography or reference section. ... The purpose of the bibliography is to provide details of the works that are cited in the text. We shall refer to cited works as references. ... The bibliography entries are listed as
. The of a reference is also used when the work is cited in the text. The lists the relevant information about the work. ... Even within a single work there may be different styles of citations. Parenthetical citations are usually formed by putting one or several citation labels inside square brackets or parentheses. However, there are also other forms of citations that are derived from information in the citation label. ... The \bibliographystyle
command tells LaTeX which style to use for the bibliography [e.g., labels as numbers, labels as names and years]. The bibliography style called