Introduction

The Jupyter notebook is an interactive environment for code evaluation, exploratory data analysis and reporting. In a notebook file it is possible to combine Markdown, program code as well as its output in corresponding input and output cells.

The primary motivation for using Jupyter notebooks is the ease of producing technical reports, software documentation, tutorials, etc, as the code fragments can be included directly in the document and the output is produced simultaneously. This feature may draw comparisons to Markdown, but the main advantage of Jupyter notebooks is that notebook files do not require compilation: in order to view the output, one doesn't have to "knit" the whole document every time.

Normally Jupyter notebook files are not distributed per se. Instead, notebook files are converted to static formats, most commonly HTML. An often overlooked potential of Jupyter notebooks rests in the format's dynamic, interactive nature. The prospect of distributing "live" Jupyter notebooks opens the possibility of truly interactive document types suitable for tutorials and demonstrations.

Historically the Notebook application was introduced in IPython in 2011. The Notebook app has evolved to support most of the commonly used programming languages and was renamed to Jupyter in 2014. In this vignette, the terms Jupyter and IPython notebook will be used interchangeably. However, note that the user commands and interface descriptions refer to IPython Notebook 3.2.1

Distribution of Jupyter Notebooks

Docker Images and Docker Hub

The IPython pipeline is notoriously difficult to install on certain operating systems. To facilitate the distribution of Jupyter notebooks, the use of Docker virtualization environment is suggested. Docker provides a system of images and containers for running applications in a Linux-like environment without the need of full virtualization.

The base Docker image for IPython toolset can be found in the registry of Docker Hub under the repository name: vladkim/ipynb. Essentially, this base image contains all the dependencies required for running IPython notebook together with the R kernel.

The strategy for distribution of Jupyter notebooks with Docker can be described as follows:

As described in subsection 2.3 all of these 4 steps are carried out by a simple function produceDockerfile, provided that you have your package vignette (or workflow) in Jupyter notebook format.

Example: Bioconductor RNA-seq workflow was converted to IPython notebook format and subsequently dockerized. On Docker Hub the image of the RNA-seq workflow is listed as vladkim/rnaseq.

Note that in the base image IPython Notebook application is used and not Jupyter. There is a subtle difference between the two since the announcement of the "Big Split" in April, 2015. The Jupyter notebook has diverged little from its original, but enough to cause certain issues, which is why the arguably more stable IPython Notebook is used for the time being.

Aesthetics and Usability of Notebooks

Even though inline CSS and JavaScript declarations are ignored in IPython Notebook 3.* series, it is possible to change the default appearance of the notebook application using custom.css and custom.js files in ~/ipython/profile_default/static/custom/ directory. Using this entry point for styling, the default notebook theme and typography settings were overwritten in the base Docker image as well as in its descendant images (e.g. vladkim/rnaseq)

screen

In addition to the custom CSS theme, the IPython notebook in the base Docker image loads a JavaScript extension "Breakpoints" to expand code execution options. The JavaScript notebook extension "Breakpoints" adds the following tool bar: breakpoints

Since there is no implicit caching of environmental variables, function and object definitions, upon restart of the notebook each cell has to be run up to the point where the last session was interrupted. By default in IPython notebook only one code cell can be run at a time. Using the "Breakpoints" toolbar one can run "from top to current cell" (second from left) or set a breakpoint, i.e. a bookmark, and run from the first cell until the next bookmark is encountered.

Creating Your Own Docker Image

The basis of a Docker image is a Dockerfile, a script with installation and set-up instructions passed to Docker. Suppose you want to convert your R package documentation into Jupyter notebook format and build a Docker image for running your vignette.

The conversion of static formats to .ipynb is covered in the next section. Assuming you already produced an .ipynb file, in order to dockerize your R package documentation, simply deposit the DESCRIPTION file of your package and the .ipynb vignette in the same folder, e.g. /home/example and pass the folder and the vignette name to produceDockerfile function to generate a Dockerfile:

library(RdocsJupyter)
produceDockerfile(dirc = "/home/example", vignette = "vignette.ipynb")

This will generate a Dockerfile with the following content:

FROM vladkim/ipynb:latest
COPY packages.R ./
## install R package dependencies
RUN R < packages.R --no-save && rm packages.R
## copy your vignette to the image working directory
COPY vignette.ipynb ./

The produceDockerfile function also extracts dependencies from the DESCRIPTION file and generates the packages.R script to install these dependencies on top of the base image vladkim/ipynb.

If you don't have a DESCRIPTION file, you can simply list all your dependencies in a text file in the Depends: field. Make sure you name the file DESCRIPTION, as this is the file that produceDockerfile is searching for.

In order to build your Docker image, simply run in the directory containing the Dockerfile

$ docker build -t username/example .

Conversion of Bioconductor Vignettes

Most of the existing R documentation is written using R markdown, knitr or Sweave. An important usability criterion for package authors would be the ease of conversion of static formats into Jupyter notebook format. The prerequisites for such document conversions would be pandoc and notedown. The most straightforward type of conversion is from Markdown to Jupyter notebook, which can be achieved with the notedown utility available on GitHub. The LaTeX-like vignettes written with knitr or Sweave need to be converted to Markdown with pandoc before the final conversion to .ipynb takes place.

R Markdown to Jupyter

To convert documentation in R Markdown into Jupyter notebook format, use the CRAN package rmarkdown to render the .Rmd file into Markdown Github:

library(rmarkdown)
render("vignette.Rmd", md_document(variant = "markdown_github"))

If your documentation contains typeset math in LaTeX notation, run instead

render("vignette.Rmd", md_document(variant = "markdown_github+tex_math_dollars"))

The pandoc extension +tex_math_dollars ensures that the LaTeX math is typeset correctly in markdown. For a comprehensive list of pandoc-TeX extensions please refer to the following link. These extensions can be added to or removed from the target file format explicitly with the '+' and '-' signs.

Next use notedown to convert the produced Markdown file into Jupyter notebook format. With notedown simply run in bash

$ notedown -o output.ipynb --nomagic input.md

The --nomagic flag suppresses %%R prefix tags in code blocks necessary for running R code with the native Python kernel. After converting the Rmd vignette into Jupyter notebook, the last step would be to switch to the R kernel in the notebook interface. Thus run

$ ipython notebook output.ipynb

This will open the notebook in your browser. Click on Kernel tab -> Change kernel -> R and save the notebook.

The current version of notedown supports the direct conversion of R markdown to Jupyter notebook format. However, it is recommended to render R Markdown into markdown_github before applying notedown. This becomes important (almost inevitable), if you include your bibliography as a BibTeX file.

Conversion of Sweave vignettes

The conversion of Rnw vignettes is somewhat more complicated, since Sweave vignettes contain TeX syntax. Produce a TeX file using Sweave() function. Run in R

 options(prompt=" ", continue=" ")
 Sweave("vignette.Rnw", eval=FALSE)

The options(...) command overwrites the default options(prompt="> ", continue="+ ") for prepending ">" in the beginning of each line and "+" for line continuation. These signs have to be removed for the code in notebook cells to be runnable.

The next step would be to preprocess the TeX file with processTex function in R:

 library(RdocsJupyter)
 processTex("vignette.tex")
 ```
 This will replace all Biconductor.sty macros, e.g.~`\R{}` or `\Robj{}`, remove all floats in the TeX file (such as `{figure}` environment) and replace some other environments known to be unrecognized by pandoc.

Convert the postprocessed TeX file into a markdown file with [**pandoc**](http://pandoc.org/). Run in bash:

$ pandoc input.tex -o output.md

The markdown file can be easily converted to Jupyter format with notedown:

$ notedown -o output.ipynb --nomagic input.md

## knitr Vignettes
To suppress output and code highlighting in knitr-generated documentation, the following options have to be set in R before the vignette is knit:
```r
knitr::opts_chunk$set(results = "hide", highlight = FALSE)

After the vignette was knit to produce a .tex file, you can run processTex function just like for Sweave vignettes:

 library(RdocsJupyter)
 processTex("vignette.tex")

Convert the postprocessed TeX file into a markdown file with pandoc:

$ pandoc input.tex -o output.md

The markdown file can be easily converted to Jupyter format with notedown:

$ notedown -o output.ipynb --nomagic input.md

General Advice

The recommendations given in the preceding sections were gathered based on some case studies of Bioconductor package vignette conversions to Jupyter Notebook format. As already pointed out, the process of conversion is limited by the pandoc's ability to handle user-defined environments and commands in LaTeX, therefore some preprocessing of TeX files is required prior to conversion with pandoc. Thus if you are using a custom-written LaTeX style file, you will have to check if all of your LaTeX commands and environments are rendered by pandoc.

Note that in pandoc there are a number of useful options that can help the vignette author avoid unnecessary work. For example, to get rid of the so called header identifiers, additional tags in braces appended to some section headers during TeX $\rightarrow$ Markdown conversion, use markdown_github flavor or the -header_attributes ("minus header attributes") extension with pandoc markdown, i.e.

$ pandoc --to=markdown_github+tex_math_dollars vignette.tex -o vignette.md
## convert to markdown GitHub "plus" TeX-style math

or

$ pandoc --to=markdown-header_attributes vignette.tex -o vignette.md
## convert to markdown "minus" header attriubtes

Note how extensions are explicitly added with '+' or removed with '-' and simply concatenated with the target format, in this case markdown_github and markdown, respectively.

To generate the table of contents with pandoc:

$ pandoc -s --toc --to=markdown-header_attributes vignette.tex -o vignette.md

The function render from the rmarkdown package uses pandoc as well. Hence pandoc extensions can be specified in md_document(variant = ...) function as discussed in section 3.1.

Current Limitations

Docker Idiosyncrasies

In theory, Docker is a great tool for cross-platform software distribution. However, currently there are some usability issues with Docker that have to be considered. On Mac OS and Windows Docker client is shipped with VirtualBox, which emulates a Linux kernel for the Docker engine. The Docker client and VirtualBox are not always perfectly synchronized and sometimes VirtualBox has to be manually activated or restarted in order to "kick-start" Docker on Windows or Mac OS.

Another unresolved issue is the inability to abort a pull or push in an intuitive manner (see issue #6928). Warning: pressing Ctrl + c does not interrupt an active pull or push process, but only terminates the Docker output, while the process is still lurking in the background. If you pressed Ctrl + C to abort a frozen pull process, you will have to restart the Docker daemon with a command specific to your OS to resume the pull (on Linux: sudo service docker restart; on Macs: docker-machine kill default).

R Kernel Issues

In R markdown vignettes the figure's width and height are set in hidden from user code chunks. These figure dimension parameters are important for keeping the unit aspect ratio and they can be extracted when processing the Rmd file. However, currently in IPython notebook these will have to be incroporated as explicit figure size declarations:

IRkernel::set_plot_options(width = 5, height = 4)

Summary and Conclusion

R markdown documentation can be easily converted to Jupyter notebook format. The conversion of TeX files is less trivial and an intermediate conversion to Markdown with pandoc is necessary. The generated Jupyter notebooks are packaged into Docker images and pushed to Docker Hub.

As mentioned in "Current Limitations" section, the user may experience some issues with Docker. The ultimate solution would be to provide a live notebook server similar to what rackspace did for Nature IPython notebook demonstration (link to the blog article here).



vladchimescu/RdocsJupyter documentation built on May 3, 2019, 6:15 p.m.