Freezing Python versions inside Bioconductor packages

knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
library(basilisk)
library(BiocStyle)

Why?

Packages like r CRANpkg("reticulate") facilitate the use of Python modules in our R-based data analyses, allowing us to leverage Python's strengths in fields such as machine learning and image analysis. However, it is notoriously difficult to ensure that a consistent version of Python is available with a consistently versioned set of modules, especially when the system installation of Python is used. As a result, we cannot easily guarantee that some Python code executed via r CRANpkg("reticulate") on one computer will yield the same results as the same code run on another computer. It is also possible that two R packages depend on incompatible versions of Python modules, such that it is impossible to use both packages at the same time. These versioning issues represent a major obstacle to reliable execution of Python code across a variety of systems via R/Bioconductor packages.

What?

r Biocpkg("basilisk") uses the conda installers to provision a Python instance that is fully managed by the Bioconductor installation machinery. This provides developers of downstream Bioconductor packages with more control over their Python environment, most typically by the creation of package-specific conda environments containing all of their required Python packages. Additionally, r Biocpkg("basilisk") provides utilities to manage different Python environments within a single R session, enabling multiple Bioconductor packages to use incompatible versions of Python packages in the course of a single analysis. These features enable reproducible analysis, simplify debugging of code and improve interoperability between compliant packages.

How?

Overview

The son.of.basilisk package (provided in the inst/example directory of this package) is provided as an example of how one might write a client package that depends on r Biocpkg("basilisk"). This is a fully operational example package that can be installed and run, so prospective developers should use it as a template for their own packages. We will assume that readers are familiar with general R package development practices and will limit our discussion to the r Biocpkg("basilisk")-specific elements.

Setting up the package

StagedInstall: no should be set, to ensure that Python packages are installed with the correct hard-coded paths within the R package installation directory.

Imports: basilisk should be set along with appropriate directives in the NAMESPACE for all r Biocpkg("basilisk") functions that are used.

.BBSoptions should contain UnsupportedPlatforms: win32, as builds on Windows 32-bit are not supported.

Creating environments

Typically, we need to create some environments with the requisite packages using the conda package manager. A basilisk.R file should be present in the R/ subdirectory containing commands to produce a BasiliskEnvironment object. These objects define the Python environments to be constructed by r Biocpkg("basilisk") on behalf of your client package.

my_env <- BasiliskEnvironment(envname="my_env_name",
    pkgname="ClientPackage",
    packages=c("pandas==0.25.1", "sklearn==0.22.1")
)

second_env <- BasiliskEnvironment(envname="second_env_name",
    pkgname="ClientPackage",
    packages=c("scipy=1.4.1", "numpy==1.17") 
)

As shown above, all listed Python packages should have valid version numbers that can be obtained by conda. It is strongly recommended to explicitly list the versions of any dependencies so as to future-proof the installation process. We suggest using listPackages() on the to identify the appropriate versions, see ?setupBasiliskEnv for details. It is also possible to install packages from PyPi via the pip= argument, but this should be done with much caution.

An executable configure file should be created in the top level of the client package, containing the command shown below. This enables creation of environments during r Biocpkg("basilisk") installation if BASILISK_USE_SYSTEM_DIR is set.

#!/bin/sh

${R_HOME}/bin/Rscript -e "basilisk::configureBasiliskEnv()"

For completeness, configure.win should also be created:

#!/bin/sh

${R_HOME}/bin${R_ARCH_BIN}/Rscript.exe -e "basilisk::configureBasiliskEnv()"

Using the environments

Any R functions that use Python code should do so via basiliskRun(), which ensures that different Bioconductor packages play nice when their dependencies clash. To use methods from the my_env environment that we previously defined, the functions in our hypothetical ClientPackage package should define functions like:

my_example_function <- function(ARG_VALUE_1, ARG_VALUE_2) { 
    proc <- basiliskStart(my_env)
    on.exit(basiliskStop(proc))

    some_useful_thing <- basiliskRun(proc, function(arg1, arg2) {
        mod <- reticulate::import("scikit-learn")
        output <- mod$some_calculation(arg1, arg2)

        # The return value MUST be a pure R object, i.e., no reticulate
        # Python objects, no pointers to shared memory. 
        output 
    }, arg1=ARG_VALUE_1, arg2=ARG_VALUE_2)

    some_useful_thing
}

The return value of basiliskRun() must be a pure R object. Developers should NOT return r CRANpkg("reticulate") bindings to Python objects or any other pointers to external memory (e.g., file handles). This is because basiliskRun() may execute in a different process such that any pointers are no longer valid when they are transferred back to the parent process. Both the arguments to the function passed to basiliskRun() and its return value MUST be amenable to serialization. Developers are encouraged to check that their function behaves correctly in a different process by setting setBasiliskShared(FALSE) and setBasiliskFork(FALSE) prior to testing.

Note that basiliskStart() will lazily install conda and the required environments if they are not already present. This can result in some delays on the first time that any function using r Biocpkg("basilisk") is called; after that, the installed content will simply be re-used.

It is probably unwise to use proc across user-visible functions, i.e., the end-user should never have an opportunity to interact with proc.

Additional notes

Important restrictions

Modifying environment variables

All environment variables described here must be set at both installation time and run time to have any effect. If any value is changed, it is generally safest to reinstall r Biocpkg("basilisk") and all of its clients.

Setting the BASILISK_EXTERNAL_DIR environment variable will change where the conda instance and environments are placed by basiliskStart() during lazy installation. This is usually unnecessary unless the default path generated by r CRANpkg("rappdirs") contains spaces or the combination of the default location and conda's directory structure exceeds the file path length limit on Windows.

Setting BASILISK_USE_SYSTEM_DIR to 1 will instruct r Biocpkg("basilisk") to install the conda instance in the R system directory during R package installation. Similarly, all (correctly configured) client packages will install their environments in the corresponding system directory when they themselves are being installed. This is very useful for enterprise-level deployments as the conda instances and environments are (i) not duplicated in each user's home directory, and (ii) always available to any user with access to the R installation. However, it requires installation from source and thus is not set by default.

It is possible to direct r Biocpkg("basilisk") to use an existing Miniconda or Anaconda instance by setting the BASILISK_EXTERNAL_CONDA environment variable to an absolute path to the installation directory. This may be desirable to avoid redundant copies of the same installation. However, in such cases, it is the user's responsibility to administer their conda instance to ensure that the correct versions of all packages are available for r Biocpkg("basilisk")'s clients.

Setting BASILISK_NO_DESTROY to 1 will instruct r Biocpkg("basilisk") to not destroy previous conda instances and environments upon installation of a new version of r Biocpkg("basilisk"). This destruction is done by default to avoid accumulating many large obsolete conda instances. However, it is not desirable if there are multiple R instances running different versions of r Biocpkg("basilisk") from the same Bioconductor release, as installation by one R instance would delete the installed content for the other. (Multiple R instances running different Bioconductor releases are not affected.) This option has no effect if BASILISK_USE_SYSTEM_DIR is set.

Custom environment creation

Given a path, basiliskStart() can operate on any conda or virtual environment, not just those constructed by r Biocpkg("basilisk"). It is possible to manually set up the desired environment with, e.g., Python 2.7, or to manually modify an existing environment to include custom Python packages not in PyPi or Conda. This allows users to continue to take advantage of r Biocpkg("basilisk")'s graceful handling of Python collisions in more complex applications. However, it can be quite difficult to guarantee portability so one should try to avoid doing this.

Session information

sessionInfo()


Try the basilisk package in your browser

Any scripts or data that you put into this service are public.

basilisk documentation built on Dec. 18, 2020, 2 a.m.