docs/20-package-management.md

Package Management

The doAzureParallel package allows you to install packages to your pool in two ways: - Installing on pool creation - Installing per-foreach loop

Packages installed at the pool level benefit from only needing to be installed once per node. Each iteration of the foreach can load the library without needing to install them again. Packages installed in the foreach benefit from specifying any dependencies required only for that instance of the loop.

Installing Packages on Pool Creation

Pool level packages support CRAN, GitHub and BioConductor packages. The packages are installed in a shared directory on the node. It is important to note that it is required to add it to .packages parameter (or github or bioconductor for github or bioconductor packages), or explicitly load any packages installed at the pool level within the foreach loop. For example, if you installed xml2 on the cluster, you must explicitly load it or add it to .packages before using it.

foreach (i = 1:4) %dopar% {
  # Load the libraries you want to use.
  library(xml2)
  xml2::as_list(...)
}

or

foreach (i = 1:4, .packages=c('xml2')) %dopar% {
  xml2::as_list(...)
}

You can install packages by specifying the package(s) in your JSON pool configuration file. This will then install the specified packages at the time of pool creation.

{
  ...
  "rPackages": {
    "cran": ["some_cran_package_name", "some_other_cran_package_name"],
    "github": ["github_username/github_package_name", "another_github_username/another_github_package_name"],
    "bioconductor": ["IRanges"]
  },
  ...
}

Installing Packages per-foreach Loop

You can also install cran packages by using the .packages option in the foreach loop. You can also install github/bioconductor packages by using the github and bioconductor" option in the foreach loop. Instead of installing packages during pool creation, packages (and its dependencies) can be installed before each iteration in the loop is run on your Azure cluster.

Installing a Github Package

doAzureParallel supports github package with the github option.

Please do not use "https://github.com/" as prefix for the github package name above.

Installing packages from a private GitHub repository

Clusters can be configured to install packages from a private GitHub repository by setting the githubAuthenticationToken property in the credentials file. If this property is blank only public repositories can be used. If a token is added then public and the private github repo can be used together.

When the cluster is created the token is passed in as an environment variable called GITHUB_PAT on start-up which lasts the life of the cluster and is looked up whenever devtools::install_github is called.

Credentials File for github authentication token

{
  ...
  "githubAuthenticationToken": "",
  ...
}

Cluster File

{
    {
    ...
    "rPackages": {
        "cran": [],
        "github": ["<project/some_private_repository>"],
        "bioconductor": []
    },
    "commandLine": []
    }
}

_More information regarding github authentication tokens can be found here

Installing Multiple Packages

By using character vectors of the packages,

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations,
                  .packages=c('package_1', 'package_2'),
                  github = c('Azure/rAzureBatch', 'Azure/doAzureParallel'),
                  bioconductor = c('IRanges', 'Biobase')) %dopar% { ... }

To install a single cran package:

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, .packages='some_package') %dopar% { ... }

To install multiple cran packages:

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, .packages=c('package_1', 'package_2')) %dopar% { ... }

To install a single github package:

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, github='azure/rAzureBatch') %dopar% { ... }

To install multiple github packages:

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, github=c('package_1', 'package_2')) %dopar% { ... }

To install a single bioconductor package:

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, bioconductor='some_package') %dopar% { ... }

To install multiple bioconductor packages:

number_of_iterations <- 10
results <- foreach(i = 1:number_of_iterations, bioconductor=c('package_1', 'package_2')) %dopar% { ... }

Installing a BioConductor Package

The default deployment of R used in the cluster (see Customizing the cluster for more information) includes the Bioconductor installer by default. Simply add packages to the cluster by adding packages in the array.

{
    {
    "name": <your pool name>,
    "vmSize": <your pool VM size name>,
    "maxTasksPerNode": <num tasks to allocate to each node>,
    "poolSize": {
        "dedicatedNodes": {
            "min": 2,
            "max": 2
        },
        "lowPriorityNodes": {
            "min": 1,
            "max": 10
        },
        "autoscaleFormula": "QUEUE"
    },
    "containerImage:" "rocker/tidyverse:latest",
    "rPackages": {
        "cran": [],
        "github": [],
        "bioconductor": ["IRanges"]
    },
    "commandLine": [],
    "subnetId": ""
    }
}

Note: Container references that are not provided by tidyverse do not support Bioconductor installs. If you choose another container, you must make sure that Bioconductor is installed.

Installing Custom Packages

doAzureParallel supports custom package installation in the cluster. Custom packages installation on the per-foreach loop level is not supported.

For steps on installing custom packages, it can be found here.

Note: If the package requires a compilation such as apt-get installations, users will be required to build their own containers.

Uninstalling a Package

Uninstalling packages from your pool is not supported. However, you may consider rebuilding your pool.



Azure/doAzureParallel documentation built on May 22, 2021, 4:39 a.m.