distributed_computing: Run an any R function on a remote HPC system with SLURM
In dkaschek/dMod: Dynamic Modeling and Parameter Estimation in ODE Models

distributed_computing

R Documentation

Run an any R function on a remote HPC system with SLURM

Description

Generates R and bash scrips, transfers them to remote via ssh. 'sshpass' needs to be installed on your local machine to circumvent password entry.

Usage

distributed_computing(
  ...,
  jobname,
  partition = "single",
  cores = 16,
  nodes = 1,
  mem_per_core = 2,
  walltime = "01:00:00",
  ssh_passwd = NULL,
  machine = "cluster",
  var_values = NULL,
  no_rep = NULL,
  recover = T,
  purge_local = F,
  compile = F,
  custom_folders = NULL,
  resetSeeds = TRUE,
  returnAll = TRUE
)

Arguments

`...`	R code to be remotely executed, parameters to be changed for runs must be named `var_i` see "Details".
`jobname`	Name in characters for the run must be unique, existing will be overwritten. Must not contain the string "Minus".
`partition`	Define the partition used for the SLURM manager. Default is "single". Should only be changed if explicitly wanted.
`cores`	Number of cores per node to be used. If a value is set to 16 < value < 25, the number of possible nodes isseverely limited. 16 or less is possible on all nodes, more on none.
`nodes`	Nodes per task. Default is 1, should not be changed since `distributed_computing()` is set up to be regulated by number of repititions.
`mem_per_core`	Memory that is reserved per core in gb. Default is 2gb.
`walltime`	Estimated runtime in the format `hh:mm:ss`, default is 1h. Jobs will be canceled after the defined time.
`ssh_passwd`	To be set when the sshpass should be used to automatically authenticate on remote via passphrase. This is an obvious security nightmare...
`machine`	SSH address in the form `user@remote_location`.
`var_values`	List of parameter arrays. The number of arrays (i.e. list entries) must correspond to the number of parameters in the function passed to `...`. These parameters must be named `var_i` where the i must replaced by the indes of the corresponding array in `var_values`. The length of the arrays define the number of nodes used with the j-th node use the j-th entry of the arrays for the corresponding `var_i`. If `no_rep` is used, `var_valus` must be set to `NULL`.
`no_rep`	Number of repetitions. Usage of this parameter and `var_values` are mutual exclusive. When used the function passed to `...` is executed `no_rep` times simultaneously using one node per realization.
`recover`	Logical parameter, if set to `TRUE` nothing is calculated, the functions `check()`, `get()` and `purge()` can be used on the results generated previously under the same `jobname`.
`purge_local`	Logical, if set to `TRUE` the `purge()` function also removes local files.
`compile`	Logical, if set to `TRUE` the source files are transferred to remote and compiled there. If set to `FALSE`, the local shared objects are transferred and used instead.
`custom_folders`	named vector with exact three entries named 'compiled', 'output' and 'tmp'. The values are strings with relative paths from the current working directory to the respective directory of the compiled files, the temporary folder from which files will be copied to the cluster and the output folder in which the calculated result from the cluster will be saved. The default is `NULL`, then everything is done from the current working directory. If only a subset of the folders should be changed, all other need to be set to `./`.
`resetSeeds`	logical, if set to `TRUE` (default) the parameter vector with random seeds `.Random.seeds` from the transferred work space is deleted on remote. This ensures that each node has uses a different set of (pseudo) random numbers. Set to FALSE at own risk.
`returnAll`	logical if set to `TRUE` (default) all results are returned, if set to `FALSE` only the `*result.RData` files are returned.

Details

distribute_computing() generates R and bash scrips designed to run on a HPC system managed by the SLURM batch manager. The current workspace together with the scripts are exported and transferred to remote via SSH. If ssh-key authentication is not possible, the ssh-password can be passed and is used by sshpass (which has to be installed on the local machine).

The code to be executed remotely is passed to the ... argument, its final output is stored in cluster_result, which is loaded in the local workspace by the get() function.

It is possible to either run repetitions of the same program realization (by use of the no_rep parameter), or to pass a list of parameter arrays via var_values. The parameters to be changed for every run *must* be named var_i where i corresponds to the i-th array in the var_values parameter.

Value

List of functions check(), get() and purge(). check() checks, if the result is ready. get() copies all files from the remote working directory to local and loads all present results (even if not all nodes where done) in the current active workspace as the object cluster_result which is a list with the results of each node as entries. purge() deletes the temporary folder on remote and if purge_local is set to TRUE.

Examples

## Not run: 
out_distributed_computing <- distributed_computing(
{
  mstrust(
    objfun=objective_function,
    center=outer_pars,
    studyname = "study",
    rinit = 1,
    rmax = 10,
    fits = 48,
    cores = 16,
    iterlim = 700,
    sd = 4
  )
},
jobname = "my_name",
partition = "single",
cores = 16,
nodes = 1,
mem_per_core = 2,
walltime = "02:00:00",
ssh_passwd = "password",
machine = "cluster",
var_values = NULL,
no_rep = 20,
recover = F,
compile = F
)
out_distributed_computing$check()
out_distributed_computing$get()
out_distributed_computing$purge()
result <- cluster_result
print(result)


# calculate profiles
var_list <- profile_pars_per_node(best_fit, 4)
profile_jobname <- paste0(fit_filename,"_profiles_opt")
method <- "optimize"
profiles_distributed_computing <- distributed_computing(
  {
    profile(
      obj = obj,
      pars =  best_fit,
      whichPar = (as.numeric(var_1):as.numeric(var_2)),
      limits = c(-5, 5),
      cores = 16,
      method = method,
      stepControl = list(
        stepsize = 1e-6,
        min = 1e-4, 
        max = Inf, 
        atol = 1e-2,
        rtol = 1e-2, 
        limit = 100
      ),
      optControl = list(iterlim = 20)
    )
  },
  jobname = profile_jobname,
  partition = "single",
  cores = 16,
  nodes = 1,
  walltime = "02:00:00",
  ssh_passwd = "password",
  machine = "cluster",
  var_values = var_list,
  no_rep = NULL,
  recover = F,
  compile = F
)
profiles_distributed_computing$check()
profiles_distributed_computing$get()
profiles_distributed_computing$purge()
profiles  <- NULL
for (i in cluster_result) {
  profiles <- rbind(profiles, i)
}

## End(Not run)

dkaschek/dMod documentation built on June 12, 2025, 2:50 a.m.