rm(list=ls())

Efficient workflow {#workflow}

Efficient programming is an important skill for generating the correct result, on time. Yet coding is only one part of a wider skillset needed for successful outcomes for projects involving R programming. Unless your project is to write generic R code (i.e. unless you are on the R Core Team), the project will probably transcend the confines of the R world: it must engage with a whole range of other factors. In this context we define 'workflow' as the sum of practices, habits and systems that enable productivity.^[The Oxford Dictionary's definition of workflow is similar, with a more industrial feel: "The sequence of industrial, administrative, or other processes through which a piece of work passes from initiation to completion." ] To some extent workflow is about personal preferences. Everyone's mind works differently so the most appropriate workflow varies from person to person and from one project to the next. Project management practices will also vary depending on the scale and type of the project: it's a big topic but can usefully be condensed in 5 top tips.

Prerequisites {-}

This chapter focuses on workflow. For project planning and management, we'll use the DiagrammeR package. For project reporting we'll focus on R Markdown and knitr which are bundled with RStudio (but can be installed independently if needed). We'll suggest other packages that are worth investigating, but are not required for this particular chapter.

library("DiagrammeR")

Top 5 tips for efficient workflow

  1. Start without writing code but with a clear mind and perhaps a pen and paper. This will ensure you keep your objectives at the forefront of your mind, without getting lost in the technology.

  2. Make a plan. The size and nature will depend on the project but time-lines, resources and 'chunking' the work will make you more effective when you start.

  3. Select the packages you will use for implementing the plan early. Minutes spent researching and selecting from the available options could save hours in the future.

  4. Document your work at every stage: work can only be effective if it's communicated clearly and code can only be efficiently understood if it's commented.

  5. Make your entire workflow as reproducible as possible. knitr can help with this in the phase of documentation.

A project planning typology

Appropriate project management structures and workflow depend on the type of project you are undertaking. The typology below demonstrates the links between project type and project management requirements.^[Thanks to Richard Cotton for suggesting this typology.]

Based on these observations we recommend thinking about which type of workflow, file structure and project management system suits your projects best. Sometimes it's best not to be prescriptive so we recommend trying different working practices to discover which works best, time permitting.^[The importance of workflow has not gone unnoticed by the R community and there are a number of different suggestions to boost R productivity. Rob Hyndman, for example, advocates the strategy of using four self-contained scripts to break up R work into manageable chunks: load.R, clean.R, func.R and do.R. ]

There are, however, concrete steps that can be taken to improve workflow in most projects that involve R programming. Learning them will, in the long-run, improve productivity and reproducibility. With these motivations in mind, the purpose of this chapter is simple: to highlight some key ingredients of an efficient R workflow. It builds on the concept of an R/RStudio project, introduced in Chapter \@ref(set-up), and is ordered chronologically throughout the stages involved in a typical project's lifespan, from its inception to publication:

Project planning and management

Good programmers working on a complex project will rarely just start typing code. Instead, they will plan the steps needed to complete the task as efficiently as possible: "smart preparation minimizes work" [@berkun2005art]. Although search engines are useful for identifying the appropriate strategy, trial-and-error approaches (for example typing code at random and Googling the inevitable error messages) are usually highly inefficient.

Strategic thinking is especially important during a project's inception: if you make a bad decision early on, it will have cascading negative impacts throughout the project's entire lifespan. So detrimental and ubiquitous is this phenomenon in software development that a term has been coined to describe it: technical debt. This has been defined as "not quite right code which we postpone making it right" [@kruchten2012technical]. Dozens of academic papers have been written on the subject but, from the perspective of beginning a project (i.e. in the planning stage, where we are now), all you need to know is that it is absolutely vital to make sensible decisions at the outset. If you do not, your project may be doomed to failure of incessant rounds of refactoring.

To minimise technical debt at the outset, the best place to start may be with a pen and paper and an open mind. Sketching out your ideas and deciding precisely what you want to do, free from the constraints of a particular piece of technology, can be a rewarding exercise before you begin. Project planning and 'visioning' can be a creative process not always well-suited to the linear logic of computing, despite recent advances in project management software, some of which are outlined in the bullet points below.

Scale can loosely be defined as the number of people working on a project. It should be considered at the outset because the importance of project management increases exponentially with the number of people involved. Project management may be trivial for a small project but if you expect it to grow, implementing a structured workflow early could avoid problems later. On small projects consisting of a 'one off' script, project management may be a distracting waste of time. Large projects involving dozens of people, on the other hand, require much effort dedicated to project management: regular meetings, division of labour and a scalable project management system to track progress, issues and priorities will inevitably consume a large proportion of the project's time. Fortunately a multitude of dedicated project management systems have been developed to cater for projects across a range of scales. These include, in rough ascending order of scale and complexity:

Regardless of the software (or lack thereof) used for project management, it involves considering the project's aims in the context of available resources (e.g. computational and programmer resources), project scope, time-scales and suitable software. And these things should be considered together. To take one example, is it worth the investment of time needed to learn a particular R package which is not essential to completing the project but which will make the code run faster? Does it make more sense to hire another programmer or invest in more computational resources to complete an urgent deadline?

Minutes spent thinking through such issues before writing a single line can save hours in the future. This is emphasised in books such as @berkun2005art and @PMBoK2000 and useful online resources such those by teamgantt.com and lasa.org.uk, which focus exclusively on project planning. This section condenses some of the most important lessons from this literature in the context of typical R projects (i.e. which involve data analysis, modelling and visualisation).

'Chunking' your work

Once a project overview has been devised and stored, in mind (for small projects, if you trust that as storage medium!) or written, a plan with a time-line can be drawn-up. The up-to-date visualisation of this plan can be a powerful reminder to yourself and collaborators of progress on the project so far. More importantly the timeline provides an overview of what needs to be done next. Setting start dates and deadlines for each task will help prioritise the work and ensure you are on track. Breaking a large project into smaller chunks is highly recommended, making huge, complex tasks more achievable and modular [@PMBoK2000]. 'Chunking' the work will also make collaboration easier, as we shall see in Chapter 5.

local(source("code/04-project-planning_f1.R", local=TRUE))

The tasks that a project should be split into will depend on the nature of the work and the phases illustrated in Figure \@ref(fig:4-1) represent a rough starting point, not a template and the 'programming' phase will usually need to be split into at least 'data tidying', 'processing', and 'visualisation'.

Making your workflow SMART {#smart}

A more rigorous (but potentially onerous) way to project plan is to divide the work into a series of objectives and track their progress throughout the project's duration. One way to check if an objective is appropriate for action and review is by using the SMART criteria.

If the answer to each of these questions is 'yes', the task is likely to be suitable to include in the project's plan. Note that this does not mean all project plans need to be uniform. A project plan can take many forms, including a short document, a Gantt chart(see Figure \@ref(fig:4-2)) or simply a clear vision of the project's steps in mind.

## Code in "code/04-project-planning_f2.R"
knitr::include_graphics("figures/f4_2_DiagrammeR-gantt-book.png")

Visualising plans with R

Various R packages can help visualise the project plan. While these are useful, they cannot compete with the dedicated project management software outlined at the outset of this section. However, if you are working on relatively simple project, it is useful to know that R can help represent and keep track of your work. Packages for plotting project progress include:^[For a more comprehensive discussion of Gantt charts in R, please refer to stackoverflow.com/questions/3550341.]

The small example below (which provides the basis for creating charts like Figure \@ref(fig:4-2)) illustrates how DiagrammeR can take simple text inputs to create informative up-to-date Gantt charts. Such charts can greatly help with the planning and task management of long and complex R projects, as long as they do not take away valuable programming time from core project objectives.

library("DiagrammeR")
# Define the Gantt chart and plot the result (not shown)
mermaid("gantt
        Section Initiation
        Planning           :a1, 2016-01-01, 10d
        Data processing    :after a1  , 30d")

In the above code gantt defines the subsequent data layout. Section refers to the project's section (useful for large projects, with milestones) and each new line refers to a discrete task. Planning, for example, has the code a, begins on the first day of 2016 and lasts for 10 days. See knsv.github.io/mermaid/gantt.html for more detailed documentation.

Exercises {-}

  1. What are the three most important work 'chunks' of your current R project?

  2. What is the meaning of 'SMART' objectives (see Making your workflow SMART).

  3. Run the code chunk at the end of this section to see the output.

  4. Bonus exercise: modify this code to create a basic Gantt chart for an R project you are working on.

Package selection

A good example of the importance of prior planning to minimise effort and reduce technical debt is package selection. An inefficient, poorly supported or simply outdated package can waste hours. When a more appropriate alternative is available this waste can be prevented by prior planning. There are many poor packages on CRAN and much duplication so it's easy to go wrong. Just because a certain package can solve a particular problem, doesn't mean that it should.

Used well, however, packages can greatly improve productivity: not reinventing the wheel is part of the ethos of open source software. If someone has already solved a particular technical problem, you don't have to re-write their code, allowing you to focus on solving the applied problem. Furthermore, because R packages are generally (but not always) written by competent programmers and subject to user feedback, they may work faster and more effectively than the hastily prepared code you may have written. All R code is open source and potentially subject to peer review. A prerequisite of publishing an R package is that developer contact details must be provided, and many packages provide a site for issue tracking. Furthermore, R packages can increase programmer productivity by dramatically reducing the amount of code they need to write because all the code is packaged in functions behind the scenes.

Let's take an example. Imagine for a project you would like to find the distance between sets of points (origins, o and destinations, d) on the Earth's surface. Background reading shows that a good approximation of 'great circle' distance, which accounts for the curvature of the Earth, can be made by using the Haversine formula, which you duly implement, involving much trial and error:

# Generate lat/lon pairs
o = ggmap::geocode("Leeds")
d = ggmap::geocode("Newcastle")
dput(o)
dput(d)
# Function to convert degrees to radians
deg2rad = function(deg) deg * pi / 180

# Create origins and destinations
o = c(lon = -1.55, lat = 53.80)
d = c(lon = -1.61, lat = 54.98)

# Convert to radians
o_rad = deg2rad(o)
d_rad = deg2rad(d)

# Find difference in degrees
delta_lon = (o_rad[1] - d_rad[1])
delta_lat = (o_rad[2] - d_rad[2])

# Calculate distance with Haversine formula
a = sin(delta_lat / 2)^2 + cos(o_rad[2]) * cos(d_rad[2]) * sin(delta_lon / 2)^2
c = 2 * asin(min(1, sqrt(a)))
(d_hav1 = 6371 * c) # multiply by Earth's diameter

This method works but it takes time to write, test and debug. Much better to package it up into a function. Or even better, use a function that someone else has written and put in a package:

# Find great circle distance with geosphere
(d_hav2 = geosphere::distHaversine(o, d))

The difference between the hard-coded method and the package method is striking. One is 7 lines of tricky R code involving many subsetting stages and small, similar functions (e.g. sin and asin) which are easy to confuse. The other is one line of simple code. The package method using geosphere took perhaps 100^th^ of the time and gave a more accurate result (because it uses a more accurate estimate of the diameter of the Earth). This means that a couple of minutes searching for a package to estimate great circle distances would have been time well spent at the outset of this project. But how do you search for packages?

Searching for R packages

Building on the example above, how can one find out if there is a package to solve your particular problem? The first stage is to guess: if it is a common problem, someone has probably tried to solve it. The second stage is to search. A simple Google query, haversine formula R, returned a link to the geosphere package in the second result (a hardcoded implementation was first).

Beyond Google, there are also several sites for searching for packages and functions. rdocumentation.org provides a multi-field search environment to pinpoint the function or package you need. Amazingly, the search for haversine in the Description field yielded 10 results from 8 packages: R has at least 8 implementations of the Haversine formula! This shows the importance of careful package selection as there are often many packages that do the same job, as we see in the next section. There is also a way to find the function from within R, with RSiteSearch(), which opens a url in your browser linking to a number of functions (40) and vignettes (2) that mention the text string:

# Search CRAN for mentions of haversine
RSiteSearch("haversine")

How to select a package

Due to the conservative nature of base R development, which rightly prioritises stability over innovation, much of the innovation and performance gains in the 'R ecosystem' has occurred in recent years in the packages. The increased ease of package development [@Wickham_2015] and interfacing with other languages [e.g. @Eddelbuettel_2011] has accelerated their number, quality and efficiency. An additional factor has been the growth in collaboration and peer review in package development, driven by code-sharing websites such as GitHub and online communities such as ROpenSci for peer reviewing code.

Performance, stability and ease of use should be high on the priority list when choosing which package to use. Another more subtle factor is that some packages work better together than others. The 'R package ecosystem' is composed of interrelated packages. Knowing something of these inter-dependencies can help select a 'package suite' when the project demands a number of diverse yet interrelated programming tasks. The 'tidyverse', for example, contains many interrelated packages that work well together, such as readr, tidyr, and dplyr.^[An excellent overview of the 'tidyverse', formerly now as the 'hadleyverse' and its benefits is available from barryrowlingson.github.io/hadleyverse.] These may be used together to read-in, tidy and then process the data, as outlined in the subsequent sections.

There is no 'hard and fast' rule about which package you should use and new packages are emerging all the time. The ultimate test will be empirical evidence: does it get the job done on your data? However, the following criteria should provide a good indication of whether a package is worth an investment of your precious time, or even installing on your computer:

An article in simplystats discusses this issue with reference to the proliferation of GitHub packages (those that are not available on CRAN). In this context well-regarded and experienced package creators and 'indirect data' such as amount of GitHub activity are also highlighted as reasons to trust a package.

The websites of MRAN and METACRAN can help the package selection process by providing further information on each package uploaded to CRAN. METACRAN, for example, provides metadata about R packages via a simple API and the provision of 'badges' to show how many downloads a particular package has per month. Returning to the Haversine example above, we could find out how many times two packages that implement the formula are downloaded each month with the following urls:

# download.file("http://cranlogs.r-pkg.org/badges/last-month/geosphere",
#               "figures/geosphere-badge.svg")
# bm = rsvg::rsvg("figures/geosphere-badge.svg", width = 400)
# png::writePNG(bm, "figures/geosphere-badge.png")
knitr::include_graphics("figures/f4_3_geosphere-badge.png")
# download.file("http://cranlogs.r-pkg.org/badges/last-month/geoPlot",
#               "figures/geoPlot-badge.svg")
# bm = rsvg::rsvg("figures/geoPlot-badge.svg", width = 400)
# png::writePNG(bm, "figures/geoPlot-badge.png")
knitr::include_graphics("figures/f4_4_geoPlot-badge.png")

It is clear from the results reported above that geosphere is by far the more popular package, so is a sensible and mature choice for dealing with distances on the Earth's surface.

Publication

The final stage in a typical project workflow is publication. Although it's the final stage to be worked on, that does not mean you should only document after the other stages are complete: making documentation integral to your overall workflow will make this stage much easier and more efficient.

Whether the final output is a report containing graphics produced by R, an online platform for exploring results or well-documented code that colleagues can use to improve their workflow, starting it early is a good plan. In every case the programming principles of reproducibility, modularity and DRY (don't repeat yourself) will make your publications faster to write, easier to maintain and more useful to others.

Instead of attempting a comprehensive treatment of the topic we will touch briefly on a couple of ways of documenting your work in R: dynamic reports and R packages. A wealth of online resources exists on each of these; to avoid duplication of effort the focus is on documentation from a workflow efficiency perspective.

Dynamic documents with R Markdown

When writing a report using R outputs a typical workflow has historically been to 1) do the analysis 2) save the resulting graphics and record the main results outside the R project and 3) open a program unrelated to R such as LibreOffice to import and communicate the results in prose. This is inefficient: it makes updating and maintaining the outputs difficult (when the data changes, steps 1 to 3 will have to be done again) and there is an overhead involved in jumping between incompatible computing environments.

To overcome this inefficiency in the documentation of R outputs the R Markdown framework was developed. Used in conjunction with the knitr package, we have the ability to

R Markdown documents are plain text and have file extension .Rmd. This framework allows for documents to be generated automatically. Furthermore, nothing is efficient unless you can quickly redo it. Documenting your code inside dynamic documents in this way ensures that analysis can be quickly re-run.

```{block, rmd, type='rmdnote'} This note briefly explains R Markdown for the un-initiated. R markdown is a form of Markdown. Markdown is a pure text document format that has become a standard for documentation for software. It is the default format for displaying text on GitHub. R Markdown allows the user to embed R code in a Markdown document. This is a powerful addition to Markdown, as it allows custom images, tables and even interactive visualisations, to be included in your R documents. R Markdown is an efficient file format to write in because it is light-weight, human and computer readable, and is much less verbose than HTML and LaTeX. This book was written in R Markdown.

In an R Markdown document, results are generated *on the fly* by including 'code chunks'. Code chunks are R code that are preceded by ` ```r ` on the line before the R code, and ` ``` ` at the end of the chunk. For example, suppose we have  the code chunk

    ```r`r ''`
    (1:5)^2
    ```

in an R Markdown document. The `eval = TRUE` in the code indicates that the code should be evaluated while `echo = TRUE` controls whether the R code is displayed. When we compile the document, we get

```r
(1:5)^2

R Markdown via knitr provides a wide range of options to customise what is displayed and evaluated. When you adapt to this workflow it is highly efficient, especially as RStudio provides a number of shortcuts that make it easy to create and modify code chunks. To create a chunk while editing a .Rmd file, for example, simply enter Ctrl/Cmd+Alt+I on Windows or Linux or select the option from the Code drop down menu.

Once your document has compiled it should appear on your screen in the file format requested. If a html file has been generated (as is the default), RStudio provides a feature that allows you to put it up online rapidly. This is done using the rpubs website, a store of a huge number of dynamic documents (which could be a good source of inspiration for your publications). Assuming you have an RStudio account, clicking the 'Publish' button at the top of the html output window will instantly publish your work online, with a minimum of effort, enabling fast and efficient communication with many collaborators and the public.

An important advantage of dynamically documenting work this way is that when the data or analysis code changes, the results will be updated in the document automatically. This can save hours of fiddly copying and pasting of R output between different programs. Also, if your client wants pages and pages of documented output, knitr can provide them with a minimum of typing, e.g. by creating slightly different versions of the same plot over and over again. From a delivery of content perspective, that is certainly an efficiency gain compared with hours of copying and pasting figures!

If your R Markdown documents include time-consuming processing stages, a speed boost can be attained after the first build by setting opts_chunk$set(cache = TRUE) in the first chunk of the document. This setting was used to reduce the build times of this book, as can be seen on GitHub.

Furthermore dynamic documents written in R Markdown can compile into a range of output formats including html, pdf and Microsoft's docx. There is a wealth of information on the details of dynamic report writing that is not worth replicating here. Key references are RStudio's excellent website on R Markdown hosted at rmarkdown.rstudio.com and, for a more detailed account of dynamic documents with R, [@xie2015dynamic].

R packages

A strict approach to project management and workflow is treating your projects as R packages. This approach has advantages and limitations. The major risk with treating a project as a package is that the package is quite a strict way of organising work. Packages are suited for code intensive projects where code documentation is important. An intermediate approach is to use a 'dummy package' that includes a DESCRIPTION file in the root directory telling users of the project which packages must be installed for the code to work. This book is based on a dummy package so that we can easily keep the dependencies up-to-date (see the book's DESCRIPTION file online for an insight into how this works).

Creating packages is good practice in terms of learning to correctly document your code, store example data, and even (via vignettes) ensure reproducibility. But it can take a lot of extra time so should not be taken lightly. This approach to R workflow is appropriate for managing complex projects which repeatedly use the same routines which can be converted into functions. Creating project packages can provide a foundation for generalising your code for use by others, e.g. via publication on GitHub or CRAN. And R package development has been made much easier in recent years by the development of the devtools package, which is highly recommended for anyone attempting to write an R package.

The number of essential elements of R packages differentiate them from other R projects. Three of these are outlined below from an efficiency perspective.

The package testthat makes it easier than ever to test your R code as you go, ensuring that nothing breaks. This, combined with 'continuous integration' services, such as that provided by Travis, make updating your code base as efficient and robust as possible. This, and more, is described in [@cotton_testing_2016].

As with dynamic documents, package development is a large topic. For small 'one-off' projects the time taken in learning how to set-up a package may not be worth the savings. However packages provide a rigorous way of storing code, data and documentation that can greatly boost productivity in the long-run. For more on R packages see [@Wickham_2015]: the online version provides all you need to know about writing R packages for free (see r-pkgs.had.co.nz/).



akrmenec/Effic_R_bk documentation built on May 28, 2019, 4:53 p.m.