Running rc_kappa

Introduction

This is how to run the rc_kappa code. There are two ways this code runs, in parallel on a single computer, and distributed on the cluster.

What's in the package

Shell Scripts

R main scripts and shell scripts

Code

Science

Running distributed on the cluster

There is some introductory information for using the cluster in run_ramp.md. For rc_kappa, in particular, there are three steps.

  1. Create a plan.
  2. Run workers to compute that plan.
  3. Assemble the results from the workers.

All three steps take the same command line arguments. There is a single command sets up for a distributed run, rc_cluster.R. The first item below will describe using this. The following items break it down into parts to explain what happens inside that first command.

Set up for a cluster

If you look in scripts/Makefile, you'll see a call like this one:

/ihme/singularity-images/rstudio/shells/execRscript.sh \
-i /ihme/singularity-images/rstudio/ihme_rstudio_4051.img \
-s rc_cluster.R --config=210908_world.toml   --outvars=210908_fewer \
--years=2019:2019 --draws=1000 --tasks=1000

Inside this command's main() function, it does the following:

  1. Reads the given config file.
  2. Writes a plan file, which divides the problem for the cluster nodes using domain decomposition.
  3. Writes a shell script to do the work that you can submit to the cluster using qsub. It does this by substituting values into a template shell script.
  4. Writes another shell script to assemble the resulting results into results.

You can edit the shell scripts. If, for instance, there is less work than there are tasks, then that will lead to the assembly failing. In that case, reduce the number of cluster tasks to be less than the length of the plan file (minus one), and run again.

Your job is to login to the cluster, submit the first script and look at its job number. That job number will have digits and then a period, like this: 234247929.1-1000. The part before the period is the Job ID, and the rest is the Task Id. When you submit the assembling, you can qsub it and append to that qsub command a flag to tell it to wait for the previous job id.

Create a plan.

The first script is rc_run_plan.R.

/ihme/singularity-images/rstudio/shells/execRscript.sh -i /ihme/singularity-images/rstudio/ihme_rstudio_4030.img -s rc_run_plan.R --config=rc_kappa.toml --outvars=201121_split100 --years=2000:2019 --draws=100 --tasks=100

This step is quick to run. I do it interactively. It creates two files in the outvars directory:

Run workers on that plan

The worker runs under SGE. The shell script to qsub, with qsub rc_worker.sh, is set up to handle both the initial run and rerunning any task that failed to complete. In it, you'll see these choices.

This worker took over an hour for 9 tiles per worker for 100 draws. That means 1 tile for 1000 draws could be 2 hours.

If I haven't run this script on a given number of draws, I run it first for tasks -t 100-100, and record its job id, in order to see how long it takes and how much memory it needs. These are in qacct -j <job_id> as RSS (in kb) and wallclock time. Another way, easier, is to run the comman interactively, prefixed by /usr/bin/time --verbose and adding --task=1 at the end. This will both the real time and the maximum resident set size, in kilobytes. Divide by 1024^2 to get GB for the qsub command. It should be near 1-4 GB.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/bin/sh
#$ -P proj_mmc
#$ -cwd
#$ -j y
#$ -o /share/temp/sgeoutput/adolgert/rc_kappa_$JOB_ID_$TASK_ID.txt
#$ -N rc_split
#$ -l fthread=2
#$ -t 1-100
#$ -q all.q
#$ -l h_rt=3:00:00
#$ -l m_mem_free=4G
#$ -S /bin/bash

# If there were tasks that didn't run, paste their numbers here.
# This way you can submit one job that works through the missing ones.
# MISSING=40,44,47,49,50,52
MISSING=
if [[ -n "${MISSING}" ]]
then
  export SGE_TASK_ID=`echo $MISSING | cut -d"," -f"${SGE_TASK_ID}"`
fi

VERSION=201121_split100
/ihme/singularity-images/rstudio/shells/execRscript.sh \
  -i /ihme/singularity-images/rstudio/ihme_rstudio_4030.img \
  -s rc_run_worker.R --config=rc_kappa.toml --outvars=${VERSION} \
  --years=2000:2019 --tasks=100 --draws=100

The MISSING lines are there in case any tasks fail. This is a common problem on this cluster. You'll see that the next step, to assemble the data, will report missing tasks to the command line. You can copy those values here. Then, if there are 6 missing, change the tasks to go from -t 1-6, and, instead of executing the 1st-6th tasks, it will do the missing ones.

These runs generate one HDF5 file for each worker, in the same outvars directory as above.

Assemble output from workers

Finally, we read all of the HDF5 files and create GeoTIFFs, one for each variable, for each median and confidence interval, for each year.

The rc_run_assemble.R script takes the same arguments as those above. If this script doesn't find all of its inputs, it will list those that are missing and print the missing tasks in a format suitable to place in the MISSING variable in the qsub script above. If you know some jobs failed, running this is any easy way to find the failures.

This script takes about 3GB to read data from 100 files. It works by variable, so the memory usage is low. It would be possible to run this in parallel over the variables. Takes about 20 mins now.

/ihme/singularity-images/rstudio/shells/execRscript.sh -i /ihme/singularity-images/rstudio/ihme_rstudio_4030.img -s  rc_run_assemble.R --config=rc_kappa.toml --outvars=201121_split100 --years=2000:2019 --draws=100 --tasks=100

The workers can run in parallel with GNU parallel. You use the task ID to say which variable this worker will save. There are currently 18 variables, but may change.

parallel Rscript scripts/rc_run_assemble.R --config=scripts/sam_mean.toml   --outvars=201124_africa_mean --years=2017:2017 --draws=100 --tasks=217 --overwrite --task={} ::: {1..18}


dd-harp/globalrc documentation built on Sept. 20, 2021, 12:31 a.m.