Using HTCondor for bootstrap analysis

Overview

Performing a bootstrap analysis with qtl2pleio requires first sampling a (bivariate) phenotype from the inferred bivariate normal distribution, then a two-dimensional, two-QTL QTL scan over the defined scan region. Often, one wants to include more than 100 markers in a scan region. Each two-dimensional scan can take tens of minutes with a single core. Thus, performing a bootstrap analysis with, say, 1000 bootstrap samples is prohibitively time-consuming on most computers. For this reason, we used a high-throughput computing cluster at the University of Wisconsin-Madison (where qtl2pleio was developed). The UW-Madison Center for High-Throughput Computing (CHTC) is available to all UW-Madison researchers. While we recognize that qtl2pleio users may not have access to the CHTC, we anticipate that some will have access to computing clusters. Hopefully this vignette can be adapted to your needs.

Directory structure, files & setup for using Condor

We placed on github a repository that contains the results from our bootstrap analysis of the Recla data.

Here is the repository's url: https://github.com/fboehm/qtl2pleio-manuscript-chtc/

Within this repo are three subdirectories. We'll examine in detail the "Recla-bootstrap" subdirectory.

Within this directory, we created four subdirectories:

  1. Rscript
  2. data
  3. shell_scripts
  4. submit_files

The Github repository also contains a fifth subdirectory, squid and a sixth subdirectory results, that are included for completeness. In practice, files in squid were in a distinct directory. We describe it below.

Files in Rscript

The Rscript subdirectory contains a file, boot-Recla-10-22.R that contains R code for our bootstrap analysis.

fn <- file.path("https://raw.githubusercontent.com", 
                "fboehm/qtl2pleio-manuscript/master/chtc", 
                "Recla-bootstrap/Rscript/boot-Recla-10-22.R"
                )
foo <- readLines(fn)

Here we display the contents of our boot-Recla-10-22.R file.

cat(foo, sep = "\n")

Files in data

The data subdirectory contains three rds files, one for each of the three input components.

  1. founder allele dosages for Chr 8
  2. phenotypes
  3. kinship matrix (derived via LOCO method)

Files in shell_scripts

We examine only the file that is needed for the Recla analysis, boot-Recla-10-22.sh.

Here is the contents of the file.

```{bash, eval = FALSE}

!/bin/bash

untar your R installation

tar -xzf R13.tar.gz tar -xzf SLIBS.tar.gz

make sure the script will use your R installation

export PATH=$(pwd)/R/bin:$PATH export LD_LIBRARY_PATH=$(pwd)/SS:$LD_LIBRARY_PATH

run R, with the name of your R script

R CMD BATCH '--args argname='$1' nsnp='$2' s1='$3' nboot_per_job='$4' run_num='$5'' Rscript/boot-Recla-10-22.R 'boot400_run'$5'-'$1'.Rout'

We need to unzip the R installation, which we've named R13.tar.gz, and SLIBS file on the remote computer where the actual computing will occur. Following that, and needed adjustments to the paths, we execute R. Note that we have, in this file, specified the variable values by numbers, like `$1` for `argname`. The values assigned to each of these is specified in the "submit" file.





### Files in `submit_files`

The HTCondor "submit" file provides instructions for controlling the interaction with the high-throughput resources. Below is the text in the submit file `boot-Recla-10-22.sub`

```{bash, eval = FALSE}
# hello-chtc.sub
# My very first HTCondor submit file
s1=650
# start scan at this SNP index value
nsnp=350
# length of scan in number of SNPs
# which run in the experimental design?
run_num=561
#
nboot_per_job=1
#  for almost all jobs), the desired name of the HTCondor log file,
#  and the desired name of the standard error file.  
#  Wherever you see $(Cluster), HTCondor will insert the queue number
#  assigned to this set of jobs at the time of submission.
universe = vanilla
log = boot-$(Process)-run$(run_num).log
error = boot-$(Process)-run$(run_num).err
#
# Specify your executable (single binary or a script that runs several
#  commands), arguments, and a files for HTCondor to store standard
#  output (or "screen output").
#  $(Process) will be a integer number for each job, starting with "0"
#  and increasing for the relevant number of jobs.
executable = ../shell_scripts/boot-Recla-10-22.sh
arguments = $(Process) $(nsnp) $(s1) $(nboot_per_job) $(run_num)
output = boot-$(Process)-run$(run_num).out
#
# Specify that HTCondor should transfer files to and from the
#  computer where each job runs. The last of these lines *would* be
when_to_transfer_output = ON_EXIT
transfer_input_files = ../data,../Rscript,../shell_scripts,http://proxy.chtc.wisc.edu/SQUID/fjboehm/R13.tar.gz,http://proxy.chtc.wisc.edu/SQUID/SLIBS.tar.gz
#
# Tell HTCondor what amount of compute resources
#  each job will need on the computer where it runs.
request_cpus = 1
request_memory = 3GB
request_disk = 1GB
#
requirements = (OpSysMajorVer == 6) || (OpSysMajorVer == 7)

# which computer grids to use:
+WantFlocking = true
+WantGlideIn = true

# Tell HTCondor to run instances of our job:
queue 1000

The ordering of the arguments on in the line that starts with arguments = dictates the numbers that are assigned to each argument. These numbers are used in the shell script, above.

Files on SQUID

The UW-Madison CHTC provides for each user disk space on a web proxy. More information about SQUID is available online.

In practice, we placed files containing founder allele dosages (for chromosomes of interest) and our (compressed) R installation on SQUID.

Submitting jobs to Condor (at UW-Madison CHTC)

We followed the usual procedure for submitting R jobs with the CHTC. To submit the file boot-Recla-10-22.sub, we first changed into its directory. We then typed

```{bash, eval = FALSE} condor_submit boot-Recla-10-22.sub

By using this command from within the `submit_files` subdirectory, we ensure that our outputted files are returned to the `submit_files` subdirectory. We then manually moved the outputted results files to the `results` subdirectory.

We found that up to ten percent of jobs would initially fail and require re-submission. There are a variety of reasons that may explain the need for re-submission. Because of this need, we include in the git repository the file `boot-Recla-10-22-fix.sub`, which gives the code that we used for resubmission. The file `bad-jobs-run561` gives the ids of the jobs that failed the initial runs.



## Session info

```r
devtools::session_info()


Try the qtl2pleio package in your browser

Any scripts or data that you put into this service are public.

qtl2pleio documentation built on Dec. 3, 2020, 1:06 a.m.