distributed_computing: Run an any R function on a remote HPC system with SLURM

View source: R/toolsSeverin.R

distributed_computingR Documentation

Run an any R function on a remote HPC system with SLURM

Description

Generates R and bash scrips, transfers them to remote via ssh. 'sshpass' needs to be installed on your local machine to circumvent password entry.

Usage

distributed_computing(
  ...,
  jobname,
  partition = "single",
  cores = 16,
  nodes = 1,
  walltime = "01:00:00",
  ssh_passwd = NULL,
  machine = "cluster",
  var_values = NULL,
  no_rep = NULL,
  recover = T,
  purge_local = F,
  compile = F,
  custom_folders = NULL,
  resetSeeds = TRUE
)

Arguments

...

R code to be remotely executed, parameters to be changed for runs must be named var_i see "Details".

jobname

Name in characters for the run must be unique, existing will be overwritten. Must not contain the string "Minus".

partition

Define the partition used for the SLURM manager. Default is "single". Should only be changed if explicitly wanted.

cores

Number of cores per node to be used. If a value is set to 16 < value < 25, the number of possible nodes isseverely limited. 16 or less is possible on all nodes, more on none.

nodes

Nodes per task. Default is 1, should not be changed since distributed_computing() is set up to be regulated by number of repititions.

walltime

Estimated runtime in the format hh:mm:ss, default is 1h. Jobs will be canceled after the defined time.

ssh_passwd

To be set when the sshpass should be used to automatically authenticate on remote via passphrase. This is an obvious security nightmare...

machine

SSH address in the form user@remote_location.

var_values

List of parameter arrays. The number of arrays (i.e. list entries) must correspond to the number of parameters in the function passed to .... These parameters must be named var_i where the i must replaced by the indes of the corresponding array in var_values. The length of the arrays define the number of nodes used with the j-th node use the j-th entry of the arrays for the corresponding var_i. If no_rep is used, var_valus must be set to NULL.

no_rep

Number of repetitions. Usage of this parameter and var_values are mutual exclusive. When used the function passed to ... is executed no_rep times simultaneously using one node per realization.

recover

Logical parameter, if set to TRUE nothing is calculated, the functions check(), get() and purge() can be used on the results generated previously under the same jobname.

purge_local

Logical, if set to TRUE the purge() function also removes local files.

compile

Logical, if set to TRUE the source files are transferred to remote and compiled there. If set to FALSE, the local shared objects are transferred and used instead.

custom_folders

named vector with exact three entries named 'compiled', 'output' and 'tmp'. The values are strings with relative paths from the current working directory to the respective directory of the compiled files, the temporary folder from which files will be copied to the cluster and the output folder in which the calculated result from the cluster will be saved. The default is NULL, then everything is done from the current working directory. If only a subset of the folders should be changed, all other need to be set to ./.

resetSeeds

logical, if set to TRUE (default) the parameter vector with random seeds .Random.seeds from the transferred work space is deleted on remote. This ensures that each node has uses a different set of (pseudo) random numbers. Set to FALSE at own risk.

Details

distribute_computing() generates R and bash scrips designed to run on a HPC system managed by the SLURM batch manager. The current workspace together with the scripts are exported and transferred to remote via SSH. If ssh-key authentication is not possible, the ssh-password can be passed and is used by sshpass (which has to be installed on the local machine).

The code to be executed remotely is passed to the ... argument, its final output is stored in cluster_result, which is loaded in the local workspace by the get() function.

It is possible to either run repetitions of the same program realization (by use of the no_rep parameter), or to pass a list of parameter arrays via var_values. The parameters to be changed for every run *must* be named var_i where i corresponds to the i-th array in the var_values parameter.

Value

List of functions check(), get() and purge(). check() checks, if the result is ready. get() copies all files from the remote working directory to local and loads all present results (even if not all nodes where done) in the current active workspace as the object cluster_result which is a list with the results of each node as entries. purge() deletes the temporary folder on remote and if purge_local is set to TRUE.

Examples

## Not run: 
out_distributed_computing <- distributed_computing(
{
  mstrust(
    objfun=objective_function,
    center=outer_pars,
    studyname = "study",
    rinit = 1,
    rmax = 10,
    fits = 48,
    cores = 16,
    iterlim = 700,
    sd = 4
  )
},
jobname = "my_name",
partition = "single",
cores = 16,
nodes = 1,
walltime = "02:00:00",
ssh_passwd = "password",
machine = "cluster",
var_values = NULL,
no_rep = 20,
recover = F,
compile = F
)
out_distributed_computing$check()
out_distributed_computing$get()
out_distributed_computing$purge()
result <- cluster_result
print(result)


# calculate profiles
var_list <- profile_pars_per_node(best_fit, 4)
profile_jobname <- paste0(fit_filename,"_profiles_opt")
method <- "optimize"
profiles_distributed_computing <- distributed_computing(
  {
    profile(
      obj = obj,
      pars =  best_fit,
      whichPar = (as.numeric(var_1):as.numeric(var_2)),
      limits = c(-5, 5),
      cores = 16,
      method = method,
      stepControl = list(
        stepsize = 1e-6,
        min = 1e-4, 
        max = Inf, 
        atol = 1e-2,
        rtol = 1e-2, 
        limit = 100
      ),
      optControl = list(iterlim = 20)
    )
  },
  jobname = profile_jobname,
  partition = "single",
  cores = 16,
  nodes = 1,
  walltime = "02:00:00",
  ssh_passwd = "password",
  machine = "cluster",
  var_values = var_list,
  no_rep = NULL,
  recover = F,
  compile = F
)
profiles_distributed_computing$check()
profiles_distributed_computing$get()
profiles_distributed_computing$purge()
profiles  <- NULL
for (i in cluster_result) {
  profiles <- rbind(profiles, i)
}

## End(Not run)


dkaschek/dMod documentation built on April 23, 2024, 5:18 p.m.